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Introduction 


This book is primarily based on a Machine Learning subset known as Reinforcement 
Learning. We cover the basies of Reinforcement Learning with the help of the Python 
programming language and touch on several aspects, such as Q learning, MDP, RL with 
Keras, and OpenAI Gym and OpenAI Environment, and also cover algorithms related 
to RL. 

Users need a basic understanding of programming in Python to benefit from this 
book. 

The book is meant for people who want to get into Machine Learning and learn more 
about Reinforcement Learning. 


Xlll 



CHAPTER 1 


Reinforcement Learning 
Basies 


This chapter is a brief introduction to Reinforcement Learning (RL) and includes some 
key concepts associated with it. 

In this chapter, we talk ahout Reinforcement Learning as a core concept and then 
define it further. We show a complete flow of how Reinforcement Learning works. We 
discuss exactly where Reinforcement Learning fits into artificial intelligence (AI). After 
that we define key terms related to Reinforcement Learning. We start with agents and 
then touch on environments and then finally talk ahout the connection hetween agents 
and environments. 


What Is Reinforcement Learning? 

We use Machine Learning to constantly improve the performance of machines or 
programs over time. The simplified way of implementing a process that improves 
machine performance with time is using Reinforcement Learning (RL). Reinforcement 
Learning is an approach through which intelligent programs, known as agents, work 
in a known or unknown environment to constantly adapt and learn hased on giving 
points. The feedhack might he positive, also known as rewards, or negative, also 
called punishments. Considering the agents and the environment interaction, we then 
determine which action to take. 

In a nutshell, Reinforcement Learning is hased on rewards and punishments. 

Some important points ahout Reinforcement Learning: 

• It differs from normal Machine Learning, as we do not look at 
training datasets. 

• Interaction happens not with data hut with environments, 
through which we depict real-world scenarios. 


© Abhishek Nandy and Manisha Biswas 2018 
A. Nandy and M. Biswas, Reinforcement Learning, 
https://doi.org/10.1007/978-l-4842-3285-9_l 


1 
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• As Reinforcement Learning is based on environments, many 
parameters come in to play. It takes lots of information to leam 
and act accordingly. 

• Environments in Reinforcement Learning are real-world 
scenarios that might be 2D or 3D simulated worlds or game- 
based scenarios. 

• Reinforcement Learning is broader in a sense because the 
environments can be large in scale and there might be a lot of 
factors associated with them. 

• The objective of Reinforcement Learning is to reach a goal. 

• Rewards in Reinforcement Learning are obtained from the 
environment. 

The Reinforcement Learning cycle is depicted in Figure 1-1 with the help of a robot. 



rewird ft.move to 

npw "i' 

Figure 1 -1, Reinforcement Learning cycle 
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A maze is a good example that can be studied using Reinforcement Learning, in 
order to determine the exact right moves to complete the maze (see Figure 1-2). 



Figure 1-2, Reinforcement Learning can be applied to mazes 


In Figure 1-3, we are applying Reinforcement Learning and we call it the 
Reinforcement Learning box because within its vicinity the process of RL works. RL starts 
with an intelligent program, known as agents, and when they interact with environments, 
there are rewards and punishments associated. An environment can be either known 
or unknown to the agents. The agents take actions to move to the next state in order to 
maximize rewards. 
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Actbnsarc pcrformed to maulmize 
rewards 


Figure 1-3, Reinforcernent Learningflow 

In the maze, the centralized concept is to keep moving. The goal is to ciear the maze 
and reach the end as quickly as possible. 

The following concepts of Reinforcement Learning and the working scenario are 
discussed later this chapter. 

• The agent is the intelligent program 

• The environment is the maze 

• The state is the place in the maze where the agent is 

• The action is the move we take to move to the next state 

• The reward is the points associated with reaching a particular 
state. It can be positive, negative, or zero 

We use the maze example to apply concepts of Reinforcement Learning. We will be 
describing the following steps: 

1. The concept of the maze is given to the agent. 

2. There is a task associated with the agent and Reinforcement 
Learning is applied to it. 

3. The agent receives (a-1) reinforcement for every move it 
makes from one state to other. 

4. There is a reward system in place for the agent when it moves 
from one state to another. 
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CHAPTER 1 ■ REINFORCEMENT LEARNING BASICS 


The rewards predictions are made iteratively, where we update the value of each 
state in a maze based on the value of the best subsequent state and the immediate reward 
obtained. This is called the update rule. 

The constant movement of the Reinforcement Learning process is based on 
decision-making. 

Reinforcement Learning works on a trial-and-error basis because it is very difficult to 
predict which action to take when it is in one state. From the maze problem itself, you can 
see that in order get the optimal path for the next move, you have to weigh a lot of factors. 
It is always on the basis of state action and rewards. For the maze, we have to compute 
and account for probability to take the step. 

The maze also does not consider the reward of the previous step; it is specifically 
considering the move to the next state. The concept is the same for all Reinforcement 
Learning processes. 

Here are the steps of this process: 

1. We have a problem. 

2. We have to apply Reinforcement Learning. 

3. We consider applying Reinforcement Learning as a 
Reinforcement Learning box. 

4. The Reinforcement Learning box contains all essential 
components needed for applying the Reinforcement Learning 
process. 

5. The Reinforcement Learning box contains agents, 
environments, rewards, punishments, and actions. 

Reinforcement Learning works well with intelligent program agents that give rewards 
and punishments when interacting with an environment. 

The interaction happens between the agents and the environments, as shown in 
Figure 1-4. 



Figure 1-4, Interaction between agents and environments 


From Figure 1-4, you can see that there is a direct interaction between the agents and 
its environments. This interaction is very important because through these exchanges, 
the agent adapts to the environments. When a Machine Learning program, robot, or 
Reinforcement Learning program starts working, the agents are exposed to known or 
unknown environments and the Reinforcement Learning technique allows the agents to 
interact and adapt according to the environment's features. 

Accordingly, the agents work and the Reinforcement Learning robot learns. In order 
to get to a desired position, we assign rewards and punishments. 
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Now, the program has to work around the optimal path to get maximum rewards if 
it fails (that is, it takes punishments or receives negative points). In order to reach a new 
position, which also is known as a state, it must perform what we call an action. 

To perform an action, we implement a function, also known as a policy. A policy is 
therefore a function that does some work. 


Faces of Reinforcement Learning 

As you see from the Venn diagram in Figure 1-5, Reinforcement Learning sits at the 
intersection of many different fields of Science. 


Computer Science 


Engineering 




Mathematics 


Machtne 
Liearnifig 

Optimatvl Reward 

ControI ^ System 

‘^Betnforcemeo.t " 
^■'Learning 

Operations Ctessical/Oparant 

Research Conditioning 

/ 

■ 

Bounded 
Rationality 


Neuroscience 


Psychology 


Economics 


Figure 1-5, AU the faces of Reinforcement Learning 
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The intersection points reveal a very strong feature of Reinforcement Learning—it 
shows the Science of decision-making. If we have two paths and have to decide which 
path to take so that some point is met, a scientific decision-making process can be 
designed. 

Reinforcement Learning is the fundamental Science of optimal decision-making. 

If we focus on the computer Science part of the Venn diagram in Figure 1-5, we 
see that if we want to learn, it falis under the category of Machine Learning, which is 
specifically mapped to Reinforcement Learning. 

Reinforcement Learning can be applied to many different fields of Science. In 
engineering, we have devices that focus mostly on optimal control. In neuroscience, we 
are concerned with how the brain works as a stimulant for making decisions and study 
the reward system that works on the brain (the dopamine system). 

Psychologists can apply Reinforcement Learning to determine how animals make 
decisions. In mathematics, we have a lot of data applying Reinforcement Learning in 
operations research. 


The Flow of Reinforcement Learning 

Figure 1-6 connects agents and environments. 


Strate; 



Actran 


Figure 1-6, RL structure 


The interaction happens from one state to another. The exact connection starts 
between an agent and the environment. Rewards are happening on a regular basis. 
We take appropriate actions to move from one state to another. 

The key points of consideration after going through the details are the following: 

• The Reinforcement Learning cycle works in an interconnected 
manner. 

• There is distinet communication between the agent and the 
environment. 


The distinet communication happens with rewards in mind. 
The object or robot moves from one state to another. 

An action is taken to move from one state to another 
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Figure 1-7 simplifies the interaction process. 




Figure 1-7, The entire interaction process 


An agent is always learning and finally makes a decision. An agent is a learner, which 
means there might be different paths. When the agent starts training, it starts to adapt and 
intelligently learns from its surroundings. 

The agent is also a decision maker because it tries to take an action that will get it the 
maximum reward. 

When the agent starts interacting with the environment, it can choose an action and 
respond accordingly. 

From then on, new scenes are created. When the agent changes from one place to 
another in an environment, every change results in some kind of modification. These 
changes are depicted as scenes. The transition that happens in each step helps the agent 
solve the Reinforcement Learning problem more effectively. 
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Let's look at another scenario of state transitioning, as shown in Figures 1-8 and 1-9. 



Figure 1-8, Scenario of state changes 






So -► Si-^82 

Figure 1 -9, The state transition process 


Learn to choose actions that maximize the following: 

rO +yrl +y2r2 +. where 0< y<l 

At each state transition, the reward is a different value, hence we describe reward 
with varying values in each step, such as rO, rl, r2, etc. Gamma (y) is called a discount 
factor and it determines what future reward t 3 Apes we get: 

• A gamma value of 0 means the reward is associated with the 
current state only 

• A gamma value of 1 means that the reward is long-term 

Different Terms in Reinforcement Learning 

Now we cover some common terms associated with Reinforcement Learning. 

There are two constants that are important in this case—gamma (y) and lambda (X), 
as shown in Figure 1-10. 
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Figure 1 -10, Showing values of constants 

Gamma is common in Reinforcement Learning problems but lambda is used 
generally in terms of temporal difference problems. 


Gamma 

Gamma is used in each state transition and is a constant value at each state change. 
Gamma allows you to give information about the type of reward you will be getting in 
every state. Generally, the values determine whether we are looking for reward values in 
each state only (in which case, it's 0) or if we are looking for long-term reward values (in 
which case it's 1). 


Lambda 

Lambda is generally used when we are dealing with temporal difference problems. It is 
more involved with predictions in successive States. 

Increasing values of lambda in each state shows that our algorithm is learning fast. 
The faster algorithm yields better results when using Reinforcement Learning techniques. 

As youTl learn later, temporal differences can be generalized to what we call 
TD(Lambda). We discuss it in greater depth later. 

Interactions with Reinforcement Learning 

Let's now talk about Reinforcement Learning and its interactions. As shown in 

Figure 1-11, the interactions between the agent and the environment occur with a reward. 

We need to take an action to move from one state to another. 
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State Si 



AGENT 

i 


-^ 


i 



Reward 



R(*i 





ENVJRONMENT 





^ - 


S+i 


Figure 1 -11, Reinforcement Learning interactions 


Aetion a 


Reinforcement Learning is a way of implementing how to map situations to actions 
so as to maximize and find a way to get the highest rewards. 

The machine or robot is not told which actions to take, as with other forms of 
Machine Learning, but instead the machine must discover which actions yield the 
maximum reward by trying them. 

In the most interesting and challenging cases, actions affect not only the immediate 
reward but also the next situation and all subsequent rewards. 


RL Characteristies 

We talk about characteristies next. The characteristies are generally what the agent does 
to move to the next state. The agent considers which approach works best to make the 
next move. 

The two characteristies are 

• Trial and error search. 

• Delayed reward. 

As you probably have gathered, Reinforcement Learning works on three things 
combined: 

(S,A,R) 

Where S represents state, A represents action, and R represents reward. 

If you are in a state S, you perform an action A so that you get a reward R at time 
frame t+1. Now, the most important part is when you move to the next state. In this case, 
we do not use the reward we just earned to decide where to move next. Each transition 
has a unique reward and no reward from any previous state is used to determine the next 
move. See Figure 1-12. 
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Gaining reward 
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Figure 1-12, State change with time 

The T change (the time frame) is important in terms of Reinforcement Learning. 
Every occurrence of what we do is always a combination of what we perform in terms 
of States, actions, and rewards. See Figure 1-13. 



a 



Figure 1-13, Another way ofrepresenting the state transition 


How Reward Works 

A reward is some motivator we receive when we transition from one state to another. It 
can be points, as in a video game. The more we train, the more accurate we become, and 
the greater our reward. 
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Agents 

In terms of Reinforcement Learning, agents are the Software programs that make 
intelligent decisions. Agents should be able to perceive what is happening in the 
environment. Here are the basic steps of the agents: 

1. When the agent can perceive the environment, it can make 
better decisions. 

2. The decision the agents take results in an action. 

3. The action that the agents perform must be the best, the 
optimal, one. 

Software agents might be autonomous or they might work together with other agents 
or with people. Figure 1-14 shows how the agent works. 



Figure 1 -14, Theflow ofthe environment 
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RL Environments 

The environments in the Reinforcement Learning space are comprised of certain factors 
that determine the impact on the Reinforcement Learning agent. The agent must adapt 
accordingly to the environment. These environments can be 2D worlds or grids or even a 
3D World. 

Here are some important features of environments: 

• Deterministic 

• Observable 

• Discrete or continuous 

• Single or multiagent. 


Deterministic 

If we can infer and predict what will happen with a certain scenario in the future, we say 
the scenario is deterministic. 

It is easier for RL problems to be deterministic because we donT rely on the 
decision-making process to change state. It's an immediate effect that happens with state 
transitions when we are moving from one state to another. The life of a Reinforcement 
Learning problem becomes easier. 

When we are dealing with RL, the state model we get will be either deterministic or 
non-deterministic. That means we need to understand the mechanisms behind how DFA 
and NDFA work. 


DFA (Deterministic Finite Automata) 

DFA goes through a finite number of steps. It can only perform one action for a state. See 
Figure 1-15. 
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We are showing a state transition from a start state to a final state with the help of 
a diagram. It is a simple depiction where we can say that, with some input value that is 
assumed as 1 and 0, the state transition occurs. The self-loop is created when it gets a 
value and stays in the same state. 


NDFA (Nondeterministic Finite Automaton) 

If we are working in a scenario where we donT know exactly which state a machine will 
move into, this is a case of NDFA. See Figure 1-16. 


0,1 



The working principle of the state diagram in Figure 1-16 can be explained as 
follows. In NDFA the issue is when we are transitioning from one state to another, there is 
more than one option available, as we can see in Figure 1-16. From State SO after getting 
an input such as 0, it can stay in state SO or move to state SI. There is decision-making 
involved here, so it becomes difficult to know which action to take. 

Observable 

If we can say that the environment around us is fully observable, we have a perfect 
scenario for implementing Reinforcement Learning. 

An example of perfect observability is a chess game. An example of partial 
observability is a poker game, where some of the cards are unknown to any one player. 
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Discrete or Continuous 

If there is more than one choice for transitioning to the next state, that is a continuous 
scenario. When there are a limited numher of choices, that's called a discrete scenario. 

Singie Agent and Multiagent Environments 

Solutions in Reinforcement Learning can be of singie agent types or multiagent types. 

Let's take a look at multiagent Reinforcement Learning first. 

When we are dealing with complex problems, we use multiagent Reinforcement 
Learning. Complex problems might have different environments where the agent is doing 
different jobs to get involved in RL and the agent also wants to interact. This introduces 
different complications in determining transitions in States. 

Multiagent Solutions are based on the non-deterministic approach. 

They are non-deterministic because when the multiagents interact, there might be 
more than one option to change or move to the next state and we have to make decisions 
based on that ambiguity. 

In multiagent Solutions, the agent interactions between different environments are 
enormous. They are enormous because the amount of activity involved in references to 
environments is very large. This is because the environments might be different types and 
the multiagents might have different tasks to do in each state transition. 

The difference between single-agent and multiagent Solutions are as follows: 

• Single-agent scenarios involve intelligent Software in which the 
interaction happens in one environment only. If there is another 
environment simultaneously, it cannot interact with the first 
environment. 

• When there is little bit of convergence in Reinforcement 
Learning. Convergence is when the agent needs to interact far 
more often in different environments to make a decision. This 
scenario is tackled by multiagents, as singie agents cannot tackle 
convergence. Singie agents cannot tackle convergence because 
it connects to other environments when there might be different 
scenarios involving simultaneous decision-making. 

• Multiagents have dynamic environments compared to 
singie agents. Dynamic environments can involve changing 
environments in the places to interact with. 
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Figure 1-17 shows the single-agent scenario. 


Environment 



Agent 


Goais 

Aetion 

DomainKnowledge 


Figure 1 -1 7. Single agent 


Figure 1-18 shows how multiagents work. There is an interaction between two agents 
in order to make the decision. 
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Environment 



Goais 

Actions 

Doma in knowledge 


Agent 


Agent 


Figure 1-18, Multiagent scenario 


Conclusion 

This chapter touched on the basies of Reinforcement Learning and covered some key 
concepts. We covered States and environments and howthe structure of Reinforcement 
Learning looks. 

We also touched on the different kinds of interactions and learned about single- 
agent and multiagent Solutions. 

The next chapter covers algorithms and discusses the building blocks of 
Reinforcement Learning. 
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CHAPTER 2 


RL Theory and Algorithms 


This chapter covers how Reinforcement Learning works and explains the concepts 
behind it, including the different algorithms that form the hasis of Reinforcement 
Learning. 

The chapter explains these algorithms, hut to start with, you will learn why 
Reinforcement Learning can he hard and see some different scenarios. The chapter also 
covers different ways that Reinforcement Learning can he implemented. 

Along the way, the chapter formulates the Markov Decision Process (MDP) and 
describes it. The chapter also covers SARSA and touches on temporal differences. Then, 
the chapter touches on Q Learning and dynamic programming. 


Theoretical Basis of Reinforcement Learning 

This section touches on the theoretical basis of Reinforcement Learning. Figure 2-1 shows 
how you are going to implement MDP, which is described later. 
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Figure 2-1, Theoretical basis ofMDP 

Environments in Reinforcement Learning are represented by the Markov Decision 
Process (discussed later in this chapter). 

• SS is a finite set of States. AA is a finite set of actions. 

• TiSxAxS^ [0,1]T : SxAxS^ [0, 1 ] is a transition model that maps 
(state, action, state) triples to probabilities. 

• T(s,a,s')T(s,a,s')is the probability thatyoudl land in state s's' 
if you were in state ss and took action aa. 
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In terms of conditional probabilities, the following is true: 

T(s,a,s')=P(s'|s,a)T(s,a,s')=P(s'|s,a) 

R: SxS^RR: SxS^R is a reward function that gives a real number that represents 
the amount of reward (or punishment) the environment will grant for a state transition. 
R(s,s')R(s,s') is the reward received after transitioning to state s's' from state ss. 

If the transition model is known to the agent, i.e., the agent knows where it would 
probably go from where it stands, it's fairly easy for the agent to know how to act in a way 
that maximizes its expected utility from its experience with the environment. 

We can define the expected utility for the agent to be the accumulated rewards it 
gets throughout its experience with the environment. If the agent goes through the States 
sO,sl,...,sn-l,snsO,sl,...,sn-l,sn, you could formally define its expected utility as follows: 

Int=lytE[R(st-l,st)]It=lnYtE[R(st-l,st)] 

where yy is a discount factor used to decrease the values (and hence the importance) of 
past rewards, and EE is the expected value. 

The problem arises when the agents have no clue about the probabilistic model 
behind the transitions, and this where RL comes in. The RL problem can formally be 
defined now as the problem of learning a set of parameters in order to maximize the 
expected utility. 

RL comes in two flavors: 

• Model-based: The agent attempts to sample and learn the 
probabilistic model and use it to determine the best actions it can 
take. In this flavor, the set of parameters that was vaguely referred 
to is the MDP model. 

• Model-jree: The agent doesnT bother with the MDP model and 
instead attempts to develop a control function that looks at 
the state and decides the best action to take. In that case, the 
parameters to be learned are the ones that define the control 
function. 

Where Reinforcement Learning Is Used 

This section discusses the different fields of Reinforcement Learning, as shown in 
Figure 2-2. 
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Fields of RL 



Manufacturing 


inventory management 



Finance sector 


Figure 2-2. Different fields ofReinforcement Learning 

Manufacturing 

In manufacturing, factory robots use Reinforcement Learning to move an object from one 
box and then keep it in another Container. 

If it fails or finds success upon delivering, the robot remembers the object and learns 
again, with the end resuit to get the best results with the greatest accuracy. 


Inventory Management 

In terms of inventory management, Reinforcement Learning can be used to reduce 
transit time in stocking and can be applied to placing products in warehouses for utilizing 
space optimally. 


Delivery Management 

Reinforcement Learning is applied to solve the problem of split delivery vehicle routing. 
Q Learning is used to serve appropriate customers with one vehicle. 
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Finance Sector 

Reinforcement Learning is being used for accounting, using trading strategies. 


Why Is Reinforcement Learning Difficult? 

One of the toughest parts of Reinforcement Learning is having to map the environment 
and include all possible moves. For example, consider a board game. 

You have to apply artificial intelligence to what is learned. In theory, Reinforcement 
Learning should work perfectly because there are a lot of state jumps and complex moves 
in a board game. However, applying Reinforcement Learning by itself becomes difficult. 

To get the best results, we apply a rule-based engine with Reinforcement Learning. 

If we donT apply a rule-based engine, there are so many options in board games that the 
agent will take forever to discover the path. 

First of all, we apply simple rules so that the AI learns quickly and then, as the 
complexity increases, we apply Reinforcement Learning. 

Figure 2-3 shows how applying Reinforcement Learning can be difficult. 



Figure 2-3, Reinforcement Learning with rules 
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Preparing the Machine 

Before you can run the examples, you need to perform certain steps to install the 
Software. The examples in this book use the Anaconda version of Python, so this section 
explains how to find and download it. First, you need to open a terminal. The process of 
starting the terminal is shown in Figure 2-4. 


o 




File fdit V4eM VM Tdhs hWp 

II ’ 

, ^ 

fS , 1 


, hume K | 


UhuI 


O ' ® terminal 


A AppLicatfons 



K 1 




X 

TsaM 




KTprm 


Figure 2-4, Opening the terminal 


Next, you need to update the packages. Write the following command in the terminal 
to do so. See Figure 2-5. 

sudo apt-get update 
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TermJna Termiral 


^ @ file fdit VicM 


VM 


Tabs Help 



openaf(pubuntu: 


To run a comnand as administrator (jser 
See 'pian sudo_root" for detatls. 

openai^iubuntu sudo apt-get updatel 


I II - 1 ^ Lg 

"root"), use "sudo 


g ; E bjHi a 

<coi*ii*iand>" . 



Figure 2-5, Updating the packages 


After you run the update command, the required installation content is installed, as 
shown in Figure 2-6. 


Te rm In ai 


File Edit View VM Uiti Hdp 


II 






\n. 




openal^ubuntu: - 


W 


=l£ 


To run a command as administrator (user "root"), use "sudo <command>”* 
see "nan sudo_root" for detalls. 

openal@Ltbuntu:''$ sudo apt-get update 
[sudo] password for openait 

cetii http://securlty.ubuntu*com/ubuntu xenlal-securlty inRelease [loi kB] 
Hit:2 http://us.archive.ubuntu.com/ubuntu xefiial inRelease 
Htt:J http://us, archive.ubuntu.com/ubuntu xenial-updates InHetease 
Hit:4 http://us.archive, ubuntu.com/ubuntu xenial*backports InRelease 
Fetched iGZ kB In 2s (4e.B ke/s) 

Reading package iists... Done 

openal3tJ<biintu:*$ | 


Figure 2-6, Everything has been updated 


Nowyou can use another command for installing the required packages. Figure 2-7 
shows the process. 
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sudo apt-get install golang python3-dev python-dev libcupti-dev libjpeg- 
turbo8-dev make tmux htop chromium-browser git cmake zliblg-dev libjpeg-dev 
xvfb libav-tools xorg-dev python-opengl libboost-all-dev libsdl2-dev swig. 


Tennlnal 




File Edrt Vtew VM Tflbt Help 


II 


(S 




opendi^ubuntu;" 


lopenai^lubuntuj-S sudo apt-get tnstalX golang python3-dev python-dev Ubcuptl-dew 
llbjpeg-turboS-dev make tmux htop chromium-browser git cmake zliblg-dev Itbjpeg 



-dev xvfb libav-tools xorg-dev python-opengl Itbboost-all-dev libsdlZ-dev swig 

Reading package lists... Done 

euildlng dependency tree 

Readlng state Information*** oone 

make is already the newest version (4.1-6). 

make set to nanyally tnstalled- 

The following package was automatically installed and is no longer requtred: 
libmircommons 

Use 'sudo apt autoremove' to remove it* 

The following additional packages wiil be installed: 

autotools-dev chromium-browser-lian chronlum-ccidecs-ffnpeg-extra cnake-data 
dbus dbus-xii ffmpeg fontconfig-config freeglut3 git-man 90 lang-i *6 
golang- 1 . 6 -doc golang- 1 * 6 -go golang- 1 * 6 -race-detector-runtlme golang- 1 * 6 -src 
golang-doc golang-go golang-race-detector-runtime golang-src l565-va-drlver 
icu-devtools libaacs© ItbasoundS-dev UbassS llbavcodec-ffmpeg56 
libavdevice-ffmpegSd libavfilter-ffmpegS libavfornat-ffnpegSG 
libavresample-ffmpegz libavutil-ffmpegS4 libbdplusP Itbblurayl 
llbboost-atonlc“dev Itbboost-atonici.SS-dev libboost-atomicl.5S.0 
libboost’Chrono-dev Itbboost-chronol*58*dev libboost*chronoi.58*8 
libboost-context-dev libboost-contexti*se-dev libboost-contexti*S 8 *e 
libboost-coroutine-dev libboost-coroutinel*S 8 -dev libboost-coroutinel*S 8*0 
llbboost-date-time-dev llbboost-date-tinel.SS-dev libboost-dev 


Figure 2-7, Fetching the updates 


As shown in Figure 2-8, you'll need to type y and then press Enter to continue. 


Terminal 



@ File Edit Vie* VM Tah? Help II ^ 1 j 


openartp^ubuntLi: 


j' 1*1 







vdpau-drlver-all vdpau-va-drluer xllproto-blgreqs-dev xllproto-composite-dev 
xllproto-coredev xllproto-danage-dew xllproto-dnx-dev xllproto-driZ-dev 
xllproto-dri3-dev xllproto-fixes-dev xllproto-fonts-dev xllproto-gl-dev 
xllproto-input-dev xllproto-kb-dev xllproto-present-dev xllproto-randr-dev 
xiiproto-record-dev xllproto-render-dev xllproto-resource-dev 
xllproto-scrnsaver-dev xllproto-video-dev xllproto-xcmtsc-dev 
xiiproto-xext-dev xiiproto-xfB 6 bigfont-dev xiiproto-xfBfidga-dev 
xllproto-xf 86 drl-dev xllproto-xfS 6 vidnode-dev xllproto-xinerama-dev xorg-dev 
xorg-sgnl-doctools xseruer-xorg-dev xtrans-dev xvfb zliblg-dev 
The following packages will be upgraded: 

dbus dbus-xll fontconfig-config libdbus-1-3 llfadrn-andgpul llbdrn-intell 
libdrn-nouveauZ libdrm-radeonl libdrnZ libegli-mesa libexpati libfontconfigl 
libfreetyped libgbml libgll-mesa-dri llbgll-nesa-glx libglapi-nesa 
libglibz.0*© libglibZ.e-bin libicuSS libmircllents llbnirprotobuf3 
libpulse-nainloop-glibo libpulsee libpulsedsp libpython2*7 
libpythonZ.7-nintnal libpythonZ.7-stdltb libpython3.S ltbpython3.S-mininal 
libpython3.S-stdlib libudevi libwayland-clientO libwayland-cursore 
llbwayland-egll-nesa libwayland-server© libxpn4 pulseaudio 
pulseaudio-nodule-bluetooth pulseaudlo-module-xll pulseaudio-utils pythonZ.7 
pythorZ*7-nlnlmal python3*S pythonS.S-minimal udev xserver-comnon zliblg 
48 upgraded^ 386 newly Installed, 0 to remove and 44S not upgraded. 

Need to get 283 MS of archives. 

After thls operation, 1,109 MB of additional dlsk space will be used. 

Oo you want to continue? [V/n] y| 


Figure 2-8, Continue with the installation 
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In the next step, the essential packages are downloaded and updated accordingly, as 
shown in Figure 2-9. 
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prehi^iouslv unselected package Itbopencv-tmgproc2.4vS:and&4. 
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prevtously unselected package 1.ibpD5tprcic-ffnpegS^;andb4. 
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ltbpostproc-ffnpeg5^;and64 {7:2.8.ll-CubirntuC^. 16.04.1) ... 
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to unpack .. ./llb£iabcale-rfnpeg3_7Kia2.E.ii-e!jbontue. 16.04. i_arid64.d#b .. 

\tb£h^£cale-rfripeg3!artd64 {7t2.E-Il-OMbijntiJO.i6.04.i) ... 
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Ilhzngs:an;d64 (4.1.4-7) _ 
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to unpack _/pulseaudto-nodule-bluetootb_l1t3a8.8 -0ubuntu3.3_and64.deb .. 

pulseaudto-nodule-bluetoQtb (l:8.0-0ubuntu3.3) over {l:8.8-0ubuntu3) _ 

to unpack .../pulseaudto-nodule-Kll^lXBaB.O-Oubuntul.3_andG4.deb ... 
pulseaudto-nodule-Kii (i:8.o-0(jbuntu3.3} over £its.o-eubuntu3) ... 
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Itb5dll.2debian;afid64 '£l>2.13+dfsgl-3) ... 

prevlously unselected package Itbavdevtce-ffnpeg56:aind64. 

to unpack .. ./l.tbavdevtce-ffnpegS6_71f3a2.S.il'0ubuntu6.l6.04.1_and64.deb 

ltbavdevtte-ffnpegB6:and64 {7:Z.e.ii-eubuntue.16.04.1) ... 

prevlously unselected package ltbvdpaui:and64. 

to unpack .. ./\tbvdpaul_l.l.l‘3ubuntul_andd4.dleb ... 

I.tbvdpaul:and64 Cl.l.l-3ubuntul) ... 
prevtously unselected package ffnpeg. 

to unpack .../ffnpeg_7ls3a2.S.ll-8ubuntuO.I6.04.1_and64.deb ... 
ffnpeg (7:2.B.Il'9ubuntu9.l6.04.l) ... 


[3 l3)n™ >: r^ L 


Figure 2-9, Downloading and extracting the packages 


You have now installed the Anaconda distributiori of Python. Next, you need to 
open a browser window for Ubuntu. This example shows Mozilla Firefox. Search for the 
Anaconda installation, as shown in Figure 2-10. 
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Figure 2-10, Downloading Anaconda 

Nowyou have to find the download that's appropriate for your particular operating 
System. The Anaconda page is shown in Figure 2-11. 
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Figure 2-11, Anaconda page 

Select the appropriate distribution of Anaconda, as shown in Figure 2-12. 
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Figure 2-12, Selecting the Anaconda ver sion 
Save the file next, as shown in Figure 2-13. 


hi f pT,://ww'.'.' an acon da.coiTi/downloatf/ 


C ^ Search 


openln^ Anacdndai-4.4.0-Lrnux-Ka6i_fi4,sh 



Figure 2-13, Saving the file 


29 






























CHAPTER 2 RL THEORY AND ALGORITHMS 


Now, using the terminal, you have to get inside the downloads folder. You should also 
check for the file that was being saved. See Figure 2-14. 


bash: cd: downloads: No such file or di 
openalQubuntu :~$ cd Downloads 
openal@ubuntu : ~/Downloads$ dlr 
Anaconda3-4.4.0-Llnux-x86_64.sh 
openal@ubuntu : ~/Downloads$ 

Figure 2-14. Getting inside the downloads folder 

You now have to use the bash command to run the shell script (see Figure 2-15): 
bash Anaconda3-4.4.0-Linux-x86_64.sh 

AnacondaB- 4.4.6* Linux -x86_64.sh 

ooenaltaubuntu : -/DoMnloadsS bash Anaconda3*4.4.0*Llnux*x86 64.shl 

Figure 2-15. Running the shell script 

To select the platform, type yes and press Enter. Anaconda will be installed into the 
horne location, as shown in Figure 2-16. 

Please answer ’yes’ or 'no': 

>» yes 

Anaconda3 wlll now be Installed Into thls location: 
/home/openal/anaconda3 

- Press ENTER to confirn the location 

- Press CTRL-C to abort the Installatlon 

- Or speclfy a different location below 

[/hofne/openai/anaconda3] »> |i 

M 6. Setting up the Anaconda environment 



Figure I 
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The next step, shown in Figure 2-17, will install all the important packages for 
Anaconda so that it is configured properly. 



yes 

Installing: 
installlng: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing; 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing; 
installing; 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing: 
installing; 
installing; 
installing: 

i n?;tfll linn : 


python-3,6.1'2 ... 

_license-i.l-py36_l ... 
alabaster-0.7*lO*py36_0 ,.. 
anaconda-client-1.6,3-py36_0 *., 
anaconda-navigator-1,6,2“py35_0 
anaconda-project-0,6,O-py36_O ,, 
asnlcryptO“0.22,0-py36_0 ,,, 
astroid-1,4,9-py36_0 ,,, 
astropy-1,3.2-npil2py36_0 ... 
babel-2.4,O'py36_0 .., 
backports-1.0-py36_0 ,,, 
beautifulsoup4-4,6-0-py36_0 ,,, 
bitarray-0,8,l-py36_0 ,,, 
bla2e“0,10,l-py36_0 ,,, 
bleach-l,5,0-py36_O ,,, 
bokeh-0.12,5-py36_l .,. 
botO'2,46,l'py36_0 ,., 
bottleneck-1.2.I'npll2py36_0 ,.. 
cairo-1,14,8-0 ,*, 
cffi-l,10,0-py36_0 ,,, 
chardet-3,0,3-py36_0 ,,, 
click-6,7-py36_0 ,,, 
cloudpickle-0,2,2-py36_0 ,,, 
clyent-l,2,2-py36_0 .,, 
colorama-0,3.9-py36_0 ... 
contextlib2-0,5.5-py36_0 ,.. 
cryptography-l,8*l-py36_0 ... 
curl-7-52,1-0 --- 
cycler-0,l0,0-py36_0 ,,, 
cython-0,25,2-py36_0 ,,, 
cytoolz-0.8,2-py36_0 ,,, 
dask-0,14,3-py36_l ,., 
datashape-O.5,4-py36_0 ,,, 
dbus-1,10,16-0 .,. 
decorator-4,0*ll-py36_0 ,,, 
distributed-l,16,3-py36_0 ,,, 
docutils-0,13,l-py36_0 ,,, 
entrypoints-0,2,2-py36_l ,,, 
et_xnlfile-l,0,l-py36_0 ,,, 

pxnat-7-1.0-0 - -. 


*■ V 


Figure 2-17, Installing the key packages for Anaconda 
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After the Anaconda installation is complete, you need to open a new terminal to set 
up your Anaconda environment. You have to create a new environment for Anaconda 
using the conda create command (see Figure 2-18). 


Terminat 


Fil« Edjt View VM lAbi Htip || ^ 


n 





openaipubuntu: ~ 


opepiai@ubuntii:«5 conda create - nane universe python=3,6 anaconda 




Figure 2-18, Creating an environment 


This command keeps all the packages in an isolated place. 

conda create --name universe python=3.6 anaconda 

In the next step, the Anaconda environment will install the necessary packages. See 
Figure 2-19. 
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bW -i 

■ _« 1 V ffj 

qtpy: 

l,2.1-py35 0 

readllne: 

6*2-2 

requests: 

2.14.2*py35_0 

rope: 

0*9*4-py35 1 

ruarnel_yanl: 

0*ll*14-py35_l 

sciklt*tmage: 

0.13.0-rTpll2py3S_0 

sclklt-learn: 

0.18*l-npll2py35_l 

sclpy: 

0*19*0-apll2py35_O 

seaborn: 

0,7.1-py35 0 

setuptools: 

27.2.0-py35_O 

slnplegenerlc: 

0*S*l-py35_l 

stngledlspatch: 

3*4.0* 3-py35_0 

slp: 

4.18-py35_0 

slx: 

l*10*0-py35 0 

snowballstemner: 

l*2.1-py35_0 

sortedcollectlons: 

0.5.3-py35_0 

sortedcontalners: 

1 * 5 * 7-py35_0 

sphlnx: 

1 * 5 *6-py35_0 

spyder: 

3.1.4-py35_0 

sqlalchemy: 

1.1.9-py35 0 

sqllte: 

3*13*0-0 

statsnodels: 

0.8.0-npll2py35_0 

synpy: 

1.0-py3 5^0 

tblib: 

1 * 3 . 2-py35_0 

ternlnado: 

0 . 6-py35_0 

testpath; 

0.3-py35_0 

tk: 

8*5.18-0 

toolz: 

0.8.2-py35_0 

tornado: 

4.5.l-py35_0 

traltlets: 

4.3.2-py35 0 

unlcodecsv: 

0*14*i-py35 0 

unixodbc: 

2.3.4-0 

wcwldth: 

0.1.7-py35^0 

werkzeug: 

0*12*2-py35 0 

wheel: 

0.29.0-py35_0 

wldgetsnbextenslon: 

2.0.0-py35_0 

wrapt: 

1*10*10-py35_0 

xlrd: 

1.0,0-py35_0 

xlsxwriter: 

0.9.6-py35_@ 

xlwt: 

l*2*0-py35_0 

xz: 

5.2.2-1 

yaml: 

0.1.6-0 

zeromq: 

4.1.5-0 

zlct: 

0*1 , 2-py35 0 

zlib: 

1.2.8-3 


Proceed ([y]/n)? | 


Figure 2-19, The packagesfor installing or updating Anaconda 
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Type y and then press Enter to continue. Then the entire process will be 
complete after every package is updated in the environment. You can now activate the 
environment. See Figure 2-20. 


Tine: 7Z7.S9 kB/s 

Iwidgetsnbexten Tine: 6:06:36 1.67 HB/s 

Kpyytdgets-6.6 I 66 I 1 & jflffffflflPflsirfiffeirffirviririrffirflffifflVflsinivff| Tine: 6:66:06 62S.6B kBf^ 
Undconda-4.4.6 I66^ie Tine: 6:66:66 7.77 kB/s 

package^ ... 

][ COMPLETE 

iLinktng packages ... 

COMPLETE ] I 


[ 

r 

r 

r 

p 

p 

p 

I# 


To sctivate this environrwnt, use: 

> ^ourte attivate universe 

To deactlvate this environrtent, use: 

> source deaetivate universe 


lope60l3ubuntii:''$ | 


Figure 2-20, The packagesfor installing or updating Anaconda 


Some additional updates might need to be installed. You also need to install Swig, as 
shown in Figure 2-21. 


conda install pip six libgee swig 


lop6ri6i@uburttu;»$ source activate universe 

Kunlverse) openai@ubiintu;'-$ conda Install plp slx libgee swig 

Ipetchlng package netadata . 

Isolvlng package speclfications: .. 

Ipackage plan for Installatlon In envlronnent /hone/openal/anaconda3/envs/unlverse; 
llhe followlng packages wlll be downloaded: 


package 

1 bulld 


llbgcc-s.2.e 

i Q 

1.1 HB 

anaconda-custon 

1 Py35_0 

3 KS 

swlg‘3.6.10 

1 3 

2.7 MB 


Total: 

3.8 M& 


IThe followlng NEW packages will be INSTALLED: 
swig: 3.e.ia-0 

llhe followlng packages will be updated; 

I anaconda: d.4,O-npil2py35_0 custOPi‘py35_e 

I llbgcc: 4.8,5*2 5.2.0*6 


Figure 2-21, Installing Swig too 
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You will also have to install OpenCV in order to update certain packages, as shown in 
Figure 2-22. 


itbgcc-5,z.e'0 I muunmuumnuuunnuunuuu»nMuum»»mun»munmn»mMMum 

iibgcc-s. z.e-0 100% i 

anaconda-custo 100% [MumMMummmnuummumMummuMmumnumnnnmuum 
swlg -3.0.10-0. 100% i 
Extracting packages *., 

[ COMPLETE ] I 

Unllnklng packages *** 

[ COMPLETE ] I t^nmnitm^nnmnnunnnnmmnnunnmnitnnmnnnnnitmnnitm 

Llnklng packages 
[ COMPLETE ] 

(universe) openal(gubuntu:'-S 

(universe) openal^ubuntu: •^S conda Install opencv| 


Figure 2-22, Installing OpenCV 


If there are updates to OpenCV, type y to install them too. See Figure 2-23. 


UM k vv I j cp? n 4 wupyn i -:> {.c^iud 111 ^ 1.4 l 1. v 

Fetctiin^ package ....... 

Solvinf) package specifications: .......... 

Package plan for tnstaHation \n onvtronnont /hone/openai/anacoiidd3^onv&/untver£e; 
Tf>e followtngi packages hIII t»e doMnloaded: 

packape ] build 


opencV'!!. 1.6 
qt'S.6.2 


I npll2p'y3S_l 
] py3S^e 
I 2 


16.6 n& 
aes KB 

44.2 HB 


I Total: B1.7 HB 

hbe follOHlng heu packages wUl be II1STALI,E[>: 

I openevt 3. l.B-fipi:2pif3:5_l 

lihe follCHlng packages wUl be DOWNCftftOED due to dependencjf confltcts: 


I jpeg: 

I Ubttff: 4.e.* ! 

I ptUow; 4.l.l.py3S_e 

qt: 5.e.2-4 

proceed £[y]/n)? y 

__h_ 


Sd 2 
4.5.6'2 

-3.4.2-py35_e 
--> 5.6.2-2 


Figure 2-23, Installing OpenCV 


Next, you need to install TensorFlow. This chapter shows how to install the CPU 
version. See Figure 2-24. 

pip install --upgrade tensorflow 
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IFetchlng p«ck«ges ... 

lopencv • 3 . 1 . 0 * n lOOit | iffffffffafffi 90 « 9 «f 0 V 0 fffffffVffff«tf«fi 
Iplllow 3.4.2 p 10e« I 0000000000000000000000000000000000000000000000000001 
|qt-S.6.2•2.tar 100% | 0000000000000000000000000000000000000000000000000001 

lExtractlng packagcs ... 

][ COMPLETE ] 100000000000000000000000000000000000000000000000001 

illnklng packages ... 

COMPLETE ]I 00000000000000000000000000000000000000000000000001 

ILlnklng packagcs ... 

][ COMPLETE ] 100000000000000000000000000000000000000000000000001 

l(unlverse) openalfubuntu: -$ ptp lostall -upgrade tensorfloM 
Icollectlng tensorfloM 

0o«nloadLngtensorflo*(^K3.0-cp35cp35n*nanyllnuxl_x80_64.whl (43.1MB) 
61% 26.6Me l.SMB/s eta 


Figure 2-24. Installing TensorFlow 

Figure 2-25 shows the packages being installed for TensorFlow. 


IrilifitolFfliindtjr-Br ; [(rrJW 1] s* 

n«<;vlLr«wnt «Irrid^ uf>. Lc' dj [»: 
Recwk.rEient -ilrEddv up tc ddtc-: 
ntcvlLr-Tordl: -tlrtidy 

n*^«Lr4iHnt 4lrEad^ up-bd-daiti 
DcitvLr^w-nt «IrEdd^ ue^-tc ■ date-. 
H««k.rrKnt aircadv up tc datc-i 

• l|r>Fi^n1 4|rF«d!f >(p-(<P ■ H* If 

tJfttiy ue u dait; 
H«<;«Lr*wnt alnady up^m-djit-; 
n«;uLrE'Mnt alrEady uc-bc - datE: 
IlUTiLvfrn} d’>taal|uhiPi'tu;-$ | 


T-Mch fiVr -PT dUpctcrjjt j-1t.2,o->p!jJ.3.es9' 

tnjiau - »nf(rflbi 

tvniofri.ou In ./'Hac4]ndatJ«<i[Vi/uni'lv<'r honl.t/alti ' phKt^H 

Irt .YpnBcordaClfcnvi^urAvEriE/libi^^ppttHuil.Sii^iltf ppcliiaaEa {ft*^ tEmor^t.-dF) 

HhEElFvt.lA irv ./anacond^fiHvx/untvtrEt/lib^pptlHui ica pAdtagai {Tirbn iinfErriEii) 

l.aU'tEn.iErburd..b.2.iB,Fad'.1.>9 In Ya^wwidal/cbPi/uniM-r-VE/libi^ttthdnl. S-jiiLlE-pacKai^i {frEia EEn-ufrioa} 
x-rtuptoBls- In .yaf'icpnda3:^E<Tvi/uttiiVErs.E.ritb^Fvbh[Hi3.i/sltC'[H<b»BEi (Tr«n prntaC«r>-3.J-.t-a-tETiwrUnij 

FH-r , I [ . IS In , /aiv<ipnd.'')/EBFs/pn1.'#lv'E/l Ibjpgfl Nini.-SJ-J-k l-T-patb^^TJ- lEnMJifll^-l.rnTJJtlSgardcS |.c-»|,rn>^i fliBiJ 

Airb4«H«F-t.d.4 kb ,/4n*C6*i*J/4ft«ll'vrilv4BifHlb/p:i!b4nJ.S/1k(«-p«b4fl« [fppft ttnMffk4« l4n*4rtflif-da(.2-S.>^.l-9-j-k-tn*4Efki>.} 
tliach~''i.&.B Ib ./'ani£a4Kal/'i4i¥x/vilvirii/I'k'b^pyih«ri:F.&/iii*-^ckagax (Pren ttrx^^rPI.DM.tHts.drhurd.B.^.A,-»«.1.4'.t«aiLDrTlcw> 
htfll.Sl'kb..D.Bf9'!^B In ./j«Bcand4J/cfivi|iiri'M'rU'/libijisplndnA.S-filtE-pj^haoEi {fr**! tEnioaPIcrf-tETlorboardsA.Z jP.ffS.]. 9->tersorPk-ctr] 


Figure 2-25. TensorFlow installs the packages 


The next step, shown in Figure 2-26, asks for the privileges to install the other 
packages. Type y to continue. 


(iHllvErdc} 4i-jwnBi#vbu<itu -Audo A|bt-||EC lnalAll ^ 

j Apt^trAnstflrk-iiEtpd \ 

j c*‘ftrti^\£stAS \ 

> curi \ 

> spftMartprofiErCifS^Cidman 
[AudDj (>AldwdPd lot cpe-hBl: 

RHd^nia p4e.l!A4e lliis... Dwia 
Building dci^^ndcncv trsc 
Rudlng stilfl £nftlfln r,, CflbQ 

CS.CArtlTti:;aCE'; Ix «Irtady the niBU»x'b lisrxiDn {Jd 1 MIia'4uhunl;ul). 

Cs.Crrtl.fldaCrX XAL td ih-ahuJLly ItilTSLl^. 

tlU pACkAfC wAl iuIAtUtlEAllB tl»Ul1.£d Uld It ftd PAdulrtdi 

llbnirconKnS 

U» '*uds A(H is it. 

Tlw. f^hltdwlfig nddttiiHibl paeba^Ex s1,1l b#- IraxtbliEd:: 

libdslrm'(wrl libcurlR-gnutlx lib^lib^pErl libgtkJ.pcrl iILpannd'.pE-rl pytbonl.xdftuArE^fi.rDpErtlEX xbftwere-.prdpErtlrx/.gtk 
SuggEitEd packBgtx: 

llbrdnT-rrtrtypC-pdrl libgii(2--lidC 

The TdlldMtjng KCid p«K»adi Mtll Iw InstAtled: 

Ikbcdtro-fwrl Itbgltb ptrl ItbatNZ-perX. Itbpsngo ptrl 

The TplkdHtfig psckpDe; wttl be upgradtd: 

dpt' CrantpoTt-httpx pyrl lUbcurl^ gnuCtt pyChdr3.'tDTtHnrE'pr4iptrC'l« xafttrsre :p.rD[Krtl.rx-conmin ;af tusrir-iprDpcrtiLrx gth 
t> uMreded. t newlv intiAlleil. d te re>dvt and 4 39 nei upgrAded. 
nted ce ^ci ks er Arebtvet. 

Arttr thlT e^retten. 4,)ii |qe ef Tddttteut dtEk Epecc vrtti b< useb. 

Oe VPd “4''t te centtnytz iT/eJ I 


Figure 2-26. Package installation happens 


In the next section, we install Docker. We will first learn what Docker is. 


Installing Docker 

When you want to keep your containers in the cloud, Docker is the hest option. 
Developers generally use Docker to minimize workloads on a single machine, hecause 
the entire architecture can he hosted on the developer environment. Enterprises use 
Docker to maintain an agile environment. Operators generally use Docker to keep an eye 
on apps and to run and manage them effectively. 
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Nowyou will install Docker, as it is essential for OpenAI Gym and Universe to work. 
You need to install Docker because, when you are training an environment, Docker is 
very responsive to simulations since it runs with low resources. 

The command to be entered in the terminal is shown here: 

$ sudo apt-get install \ 
apt-transport-https \ 
ca-certificates \ 
curi \ 

Software-properties-common 
The next command to enter is: 

$ curi -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add 

You use curi and the http link so that Docker can access these trusted key values. 
Now download the Docker type using this command: 

$ sudo add-apt-repository \ 

"deb [arch=amd64] https://download.docker.com/linux/ubuntu \ 

$(lsb_release -cs) \ 
stable" 

Type this command to update Docker, as shown in Figure 2-27: 

$ sudo apt-get update 



UR 

(universe) sudo \ 

^deb [archi:;;:5nd64] ihttp$;//dawnli:]3d.dactc€r.>cQn/Urii^K/tjbuntu 
S(lsb_rel.e3se -cs> \ 
stable" 

(universe) openal^ubur>tu:’^$ sudo apt-get update 

Get:l http://security.ubuntu.con/ubuntu xenlal-securlty inRelease [lez ko] 

Hlt:Z http://us.archive.ubuintu.con/ubuntiJ xental InRelease 

Get:3 http://us.archive.ubuntu.con/ubuntu xenlal-updates InRelease [132 ke] 

Get:4 https://do'rfnloadl.docker.cpn/ltnux/ubuntu xenlal InRelease [3B.9 ks] 

<iet:5 https://download.dockef.cort/linux/ubuntu xenial/stable srtd04 Packages [2.340 &] 

Cet;6 http;//u5.archive.uburitu.con/ubuntu xenlal-backports InRelease [1'02 k&j 

Cet;7 http:7/^^CHrity.ubuntu.con/ubuntu xenlal-securlty/nain and64 DEP-11 Metadata [fiS.! kB] 

Ciet:B http;//securi ty.ubuntu.com/ubuntu xenlal-securlty/nain DEP-ll £4x^4 Icons [57.B k&| 

Get:^ http://security.ubuntu.con/ubuntu xenlal-securlty/universe and64 OEP-ll >4etadata [4fi.7 k6] 
Get:lS http://securlty.ubuntu.con/ubuntu xenial-securlty/unlverse (JER-ll 04x54 Icons [59.1 k&] 

Get:ii http://uiS.archive.ubuntu.con/Mbuntu xenlal-updates/naln aimd54 Packages [dZG kB] 

Get:l2 http://us.archiue.ubuntu.cort/ubuntu xenUl-updates/naln iJGO Packages [533 kB] 

Get;13 http://us.archive.ubuntu.cori/ubuntu xenlal-updates/rtatn Translatiun-efi [259 kB] 

Get:14 http://us.archive.ubuntu.con/ubuntu xenlal-updates/rtaln and54 DEP-11 Metadata [335 kB] 

Get;lS http://us.archive.ubuntu.cor)/ubuntu xenlal-updates/naln DEP-11 54x54 Icons [233 kB] 

Get:l6 http://us.archive.ubuntu.corf/ubuntu xenlal-updates/unlverse and54 OEP-ll Metadata [171 kS] 
Get:l7 http://us.archlve.ubuntu.CDrv/ubuntu xenlal-updates/unlverse DEP-ll 64x54 Icons [2Z5 kB] 

Get:lB http://us.archive.ubuntu.carv/ubuntu xenlal-updates/nultlverse and54 DEP-ll Hetadata [s^ssa B] 
Get:l9 http://us.archive.ubuntu.corv/ubuntu xenlal'hackports/maln and64 OEP-ll Metadata [3,32B B] 
Get:2S http://us.archive.ubuntu.corv/ubuntu xenlal-backports/unlverse artd54 DEp.ll Hetadata [B,135 E] 
Fetched 2,997 kE tn l?s (239 kft/s) 

Appstream cache update conpleted, but so«e wetadata «as l§nored due to errors. 


Figure 2-27, Updating the package 


Type this command to install Docker, as shown in Figure 2-28: 
$ sudo apt-get install docker-ce 
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<Mritvtrs(> efreniHubuntii t-S spt-get docter-c^ 

Resdtng p«<k4ge Utts.., DP'»; 

Bvltdtittg dffperid^ncy trst 
RcsdVng tiiX.it tftrDrnat\tMt., r &ofi« 

ThF follDwlng p4ck»ge flutofiattcally install^d and ii<j longcr rcqutrrd; 
libfllrconnonS 

I U^B 'SiUda apt autarBtwve ' Cc rsnove IIl 
T hB l^dllowtng additt.&tial packagea kill he in&tAlL«d: 

atir&-t04l.& ^graupf&-nount 
|The fcllWlng pjckaggs ¥\\\ te lnste\lcd: 

cgri3gprs-#KHifi6 difCkt-t^C* 

t upgradtd, 3 ntkly lnst4ll«dF t rvaow and >(34 net upqradrd, 

Nced tP get 2d.5 PiB af arChtves. 

After tbts Dperatton^ 46. e tifi ef Bddttlcinal dtak spaca wtll. be usad. 

Do you want te centineaT (V/n] y 

Cot;! httpt//us^arfhtve.ubtjntV‘CDn/ubuntu xaniiat/iiintver&e ancHb^ aurs-taols a^db^ t; 4 .l lubuntui [92.9 kB] 
C«t;2 http&://doknlead.decharx-can/Llntix/obuntu xenlal/itabla aRdd4 dock#r-ae «dbd ]7.dd.l.-a#-e-ubuntu [26.2 HEj 
cet;3 http;//U6.irchtve-9bentUh€6A/ubvntM Ken\*V/gntverte aF»d6d cgrogpfs.rteunt itt 3,3 [4,9Tfl e] 

Fetehed 20,3 Mfi tn I4s [i,m kt/p) 

Svlcetlng prevtous^Ly un^clacted package 

<naadtng databaBB ... Z03S12 filas and dlraclorles currantly Installed.) 

Praparlng tD unpack .../aufs-tDols_]VJa3.Z«2B13B72Z-1.Iubuntul_and64.dab ... 

Unpacklng aufs-teols [1 : 3.2^Z01J072Z-1.lubuntul} ... 

Selactlng pravlously unselactad packaga cgroupfs-ruunt. 

Frepartng to unpack .../cgreupti-nount^l.Z^all.dab ... 

Uiipacktrig cgrqupfi-ru-unl (1.3) ... 

^iMccltng ortvltdsly unseltcted peckcige dddker'C:G. 

4r cporl ng tp unpBCk .,. . /dpc kr r ■ cc_ 17.06. l-CC ■ 0-ubuntu_a»Td64 . deb . ,. 
impaeklng dpckar^ca (17.66.l^ca-6-ubuntu) ... 

PrDcesslng trlggars for libc-bln (2.Zl-6ubuntu3) ... 
trlggars for nao*db (2.7.5*1) ... 
trlggars for uraadahaad (6.196.0-19) ... 

trtggers for sy?tetvl (239-'<abMtitu7> .... 

Sattlng up «gfa-toal» (l;3.7+J0136772-1.luhuntui> ... 

Svltlng up cgrugpfi-nouivt (l,3> ... 
stttlng up dockef'Ce ti7,06.i-cit'fl-gbu6tu) ,.. 

Rrocasslng trlggars for Hbc-bln (2.25-0ubu(itu3) ... 

trlggers for systemJ (Z29*4ubvntu7) .... 
trtggars for uraadahead (0.106.6-14) ... 
opanal|ubutitu iT$ | 


Processing 

Processing 

Processing 


Processing 

Processing 

(universe) 


Figure 2-28, Docker installation 


To test Docker, use this command (see Figure 2-29): 

$ su(Jo Service (docker start 
$ su(do (docker run hello-worl(d 


'(universe) openat@ubuntii:^$ sudo Service docker start 
(universe) openai@ubuntwj-5 sudo docker run hello-world 
unable to find Inage 'hello-world:tatest’ locally 
latest: Pulling from library/hello-worid 
b047Sdfba7Sd: Pull conplete 

oigesti shazsetf3b3b2834Si6e80Sbbi65^2c9S3i8e8Si943Qe9e6d6ffcesdzzzeibedzeff74f 
Status: Downloaded newer image for heilo-woridilatest 

Heilo frofi Docker I 

This nessage shows that your installation appears to be working correctiy. 

To generate this nessage, Docker took the foilowing steps: 

1. The Docker client contacted the Docker daenon. 

2. The Docker daemon pulled the "heilo-world“ tmage fron the Docker Hub* 

3. The Docker daemon created a new Container from that image whlch runs the 
executable that produces the output you are currently readtng, 

4. The Docker daemon streamed that output to the Docker client, which sent it 
to your terntnal. 

To try something more ambitious, you can run an Ubuntu Container with: 

S docker run -it ubuntu bash 

Share Images, automate workflows, and more with a free Docker ID: 
https: //cloud .docker .copi/ 

For more examples and ideas, visit: 
https://does.docker.con/engine/userguide/ 

(universe) openat@ubuntu:'^$ | 


Figure 2-29, Testing docker 


38 




CHAPTER 2 ■ RLTHEORYAND ALGORITHMS 


An Example of Reinforcement Learning with 
Python 

This section goes through an example of Reinforcement Learning and explains the flow of 
the algorithm. Youdl see how Reinforcement Learning can be applied. This section uses 
an open source GitHub repo that has a very good example of Reinforcement Learning. 
You will need to clone it to work with it. 

The GitHub repo linkis https: //github. com/MorvanZhou/Reinforcement- 
learning-with-tensorflow. Within the Ubuntu module, get inside the terminal and start 
cloning the repo, as shown in Figure 2-30. 



Figure 2-30, Cloning the repo 
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Figure 2-31 shows how the repo is replicated. 




Termin^l File Edit View Search ^ rminal Help 


i ^>)) 9:00 AM ^ 





ibhtgubuntui^S saurce actlv^te universe 

(universe) abhi^ubuntu:-^^ git clone https://9ithub»con/HorvaniZhou/Reinforcenent' 
learning-with-tensorfiow*git 

Clonlng Into 'Reinforcenent-learning-with-tensorflow'_ 

reno te: Countlng oh jects: 403.^ done. 

remote: Conpressing ob^ects: (S/S), done. 

remote: Total 463 (delta 2), reused S (delta 2), pack-reused 393 

Receiving objects: le&K (463/463), 278.53 KiB 1 66*66 KiS/s, dane. 

Resolving deitas: 106St (217/217), done. 

Checklng connectlutty*.. done. 

(universe) abhiOubuntu;*S | 


Figure 2-31, Replication ofthe repo 


You will next get inside the folder that you used, as shown in Figure 2-32. 


Terminal File Edit View Search Termln.^ Help 


t* 4») 9:02 AM ijl 




W 



Receiving objects: leox (493/403), 278*53 «ie I 66*66 KiB/s, done. 
Resolving deitas: 199% (217/217), done* 
checking connectivity.*. done. 

(universe) abhi@ubuntu:-$ dir 
anacondal 

Anaconda3-4.2.0-Linux-x86_64,sh 
Desktop 
Docunents 
Download5 
exanples.desktop 
gyn 
Music 





Plctures 
Public 

ftelnforcenent-learning-with-tensorfloH 
Tenplates 
universe 
untltledi.ipynb 
untitled.ipynb 
videos 

(universe) abhioubuntu;~$ cd Reinforcenent-iearning-with^tensorflow 
(universe) abhigubuntui^/Reinforcement-learning-with-tensorflowS dir 
contents experiments licence READHE*nd RL_cover*)p 9 

(universe) abhigubuntu;“/Reinforcenei[it-learnifig-with-tensorflowS cd contents 
(universe) abhigubuntu :^/Reinforcenent-learnirig-with-tensorflow/contents$ dir 
16_A3C 
ll_Dyna_Q 

12_Proxinal_Policy_Optipilzatton 

1 _c onna n d_l1n e r e in fo r c e n e n t _le a r ning 

2_Q_Lea rnlngnaze 

3_Sarsa_naze 

4 _Sa r s a _1a nbd a na z e 

S.l_Double_DQN 

S.2_Prioritized_Repl3y_DQN 

S.3_Duellng_0QN 

S_0eep_Q_Ne two rk 

6_0penAI_9ym 

7_Rolicy_gradient_saftmax 
8_Actor_Critic_Advantage 

9_Deep_Det erministic_P olle y_G r adie n t_00 PC 

(universe) abhi@ubuntu:^/Reinforcenent-lcarning-with-tensorflow/contentsS I 


Figure 2-32, Getting inside the folder 
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We are working with a scenario of Reinforcement Learning where we are applying 
the letter O as a wanderer. That wanderer wants to get the treasure T as fast as it can. 
The condition looks like this: 

0-T 

The wanderer tries to find the quickest path to reach the treasure. During each 
episode, the steps the wanderer takes to reach the treasure are counted. With each 
episode, the condition improves and the number of steps declines. 

Here are some of the basic steps in terms of Reinforcement Learning: 

• The program tries to work with actions, as actions are very 
important in terms of Reinforcement Learning. 

• The available actions for this wanderer is moving left or right: 

ACTIONS = ['left'/right'] 

• The wanderer can be considered the agent. 

• The number of States (also called the number of steps) is limited 
to 6 in this example: 

N_States = 6; 

Nowyou need to apply hyperparameters for Reinforcement Learning. 


What Are Hyperparameters? 

Hyperparameters are variables that were set before setting the modeLs parameters. 
Generally, they are different from the parameters of the model for the underlying system 
under analysis. 

We introduce epsilon, alpha, and gamma. 

• Epsilon is the greedy factor 

• Alpha is the learning rate 

• Gamma is the discount factor 

The maximum number of episodes in this case is 13. The refresh rate is when the 
scenario is refreshed. 


Writing the Code 

To create the process from which the computer learns, we have to formulate a table. This 
process is known as Q Learning and the table is called a Q table (You will learn more 
about Q Learning in the next chapter.) All the key elements are stored in the Q table and 
all the decisions are made based on the Q table. 
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def build_q_table(n_states, actions): 
table = pd.DataFrame( 

np.zeros((n_states, len(actions))), # q_table initial values 
columns=actions, # actions's name 

) 

# print(table) # show table 
return table 

Now we have to take actions. To do so, we use this code: 

def choose_action(state, q_table): 

# This is how to choose an action 
state_actions = q_table.iloc[state, :] 

if (np.random.uniformO > EPSILON) or (state_actions.all() == O): # act 

non-greedy or state-action have no value 
action_name = np.random.choice(ACTIONS) 
else: # act greedy 

action_name = state_actions.argmax() 
return action_name 

Now we create the environment and determine how the agents will work within the 
environment: 

def get_env_feedback(S, A): 

# This is how the agent will interact with the environment 

if A == 'right': # move right 

if s == N_STATES - 2: # terminate 

S_ = 'terminal' 

R = 1 
else: 

S_ = S + 1 
R = 0 

else: # move left 

R = 0 
if S == 0: 

S_ = S # reach the wall 
else: 

S_ = S - 1 
return S_, R 

This function prints the wanderer and treasure hunt conditions: 

def update_env(S, episode, step_counter): 

# This is how the environment be updated 

env_list = ['-']*(N_STATES-l) + ['T'] # '- T' our environment 

if S == 'terminal': 

interaction = 'Episode %s: total_steps = %s' % (episode+l, step_ 
counter) 
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print('\r{}'.format(interaction), end='') 
time.sleep(2) 

print('\r ', end='') 

else: 

env_list[S] = 'o' 

interaction = ''.join(env_list) 

print('\r{}'.format(interaction), end='') 

time.sleep(FRESH_™E) 

The rl () method calls the Q Learning scenario, which we discuss in next chapter: 
def rl(): 

# main part of RL loop 
q_table = build_q_table(N_STATES, ACTIONS) 
for episode in range(MAX_EPISODES): 
step_counter = 0 
S = 0 

is_terminated = False 
update_env(S, episode, step_counter) 
while not is_terminated: 

A = choose_action(S, q_table) 

S_, R = get_env_feedback(S, A) # take action & get next state 
and reward 

q_predict = q_table.ix[S, A] 
if S_ != 'terminal': 

q_target = R + GAMMA * q_table.iloc[S_, :].max() # next 

state is not terminal 
else: 

q_target = R # next state is terminal 
is_terminated = True # terminate this episode 

q_table.ix[S, A] += ALPHA * (q_target - q_predict) # update 
S = S_ # move to next state 

update_env(S, episode, step_counter+l) 
step_counter += l 
return q_table 

if _name_ == "_main_": 

q_table = rl() 
print('\r\nQ-table:\n') 
print(q_table) 
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The full code looks like this: 

import numpy as np 
import pandas as pd 
import time 

np.random.seed(2) # reproducible 

N_STATES =6 # the length of the 1 dimensional world 

ACTIONS = ['left', 'right'] # available actions 

EPSILON =0.9 # greedy police 

ALPHA =0.1 # learning rate 

GAMMA =0.9 # discount factor 

MAX_EPIS0DES =13 # maximum episodes 

FRESH_TIME =0.3 # fresh time for one move 

def build_q_table(n_states, actions): 
table = pd.DataFrame( 

np.zeros((n_states, len(actions))), # q_table initial values 
columns=actions, # actions's name 

) 

# print(table) # show table 
return table 

def choose_action(state, q_table): 

# This is how to choose an action 
state_actions = q_table.iloc[state, :] 

if (np.random.uniformO > EPSILON) or (state_actions.all() == O): # act 

non-greedy or state-action have no value 

action_name = np.random.choice(ACTIONS) 
else: # act greedy 

action_name = state_actions.argmax() 
return action_name 

def get_env_feedback(S, A): 

# This is how agent will interact with the environment 

if A == 'right': # move right 

if s == N_STATES - 2: # terminate 

S_ = 'terminal' 

R = 1 
else: 

S_ = S + 1 
R = 0 

else: # move left 

R = 0 
if S == 0: 

S_ = S # reach the wall 
else: 
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S_ = S - 1 
return S_, R 

def update_env(S, episode, step_counter): 

# This is how environment be updated 

env_list = ['-']*(N_STATES-l) + ['T'] # '- T' our environment 

if S == 'terminal': 

interaction = 'Episode %s: total_steps = %s' % (episode+l, step_ 
counter) 

print('\r{}'.format(interaction), end='') 
time.sleep(2) 

print('\r ', end='') 

else: 

env_list[S] = 'o' 

interaction = ''.join(env_list) 

print('\r{}'.format(interaction), end='') 

time.sleep(FRESH_™E) 

def rl(): 

# main part of RL loop 

q_table = build_q_table(N_STATES, ACTIONS) 
for episode in range(MAX_EPISODES): 
step_counter = O 
S = O 

is_terminated = False 
update_env(S, episode, step_counter) 
while not is_terminated: 

A = choose_action(S, q_table) 

S_, R = get_env_feedback(S, A) # take action & get next state 
and reward 

q_predict = q_table.ix[S, A] 
if S_ != 'terminal': 

q_target = R + GAMMA * q_table.iloc[S_, :].max() # next 

state is not terminal 
else: 

q_target = R # next state is terminal 
is_terminated = True # terminate this episode 

q_table.ix[S, A] += ALPHA * (q_target - q_predict) # update 
S = S_ # move to next state 

update_env(S, episode, step_counter+l) 
step_counter += l 
return q_table 

if _name_ == "_main_": 

q_table = rl() 
print('\r\nQ-table:\n') 
print(q_table) 
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Let's now run the program and analyze the output. You need to get inside the cloned 
GitHub repo and into the required folder, as shown in Figure 2-33. 


^ 


Terminal File Edit View Search Termln.^ Help 


t* ^ 4i}) 9:02 AM 



Rece-lvLng objects: leex <493/403), Z7e*S3 kIb f 66*e© KiB/s, done 
Resolvtng deitas; 199% (217/217), dotve* 
checkiog connectivlty... done. 

(universe) abhi^ubuntu:-$ dir 
anacondal 

.2.9-Linux-x36 94.sh 





AnacondaJ- 
Desktop 
Docunents 
Downloads 
exa Pipies, desktop 
gypi 
Husic 





Pictjres 
Public 

Reinforcenent-learning-with-tensorflow 
Tenplates 
universe 
untltledi,lpynb 
untltledApynb 
videos 

(universe) abhigubuntui^S cd fteinforcenent-learning-with^tensorflow 
(universe) abhigubuntu:“/Reinforceneftt-learnlng-with-tensorflowS dir 
centents experiments LICENCE READHE.nd RL_cover» 3 pg 

(universe) abhigubuntui^/Reinforcenent-learnifig-with-tensorflowS cd contents 
(universe) abhigubuntu :-/Reinforceneifit-learning-with-tensorflow/contentsS dir 
ie_A 3 C 
ll_Dyna_Q 

12 _Prt>xinal_Policy_Optipilzation 

Iconnandllnereinforcepientlearnirig 

2 _Q_Lea rningriaze 

3 _Sarsa_naze 

4 _Sarsa_ 1 anbdamaze 

5 . 1 _Double_[>QN 

S. 2 _Priaritized_Repl 3 y_DQN 

S. 3 _Oueling_ 00 N 

S_ 0 eep_Q_Ne two rk 

6 _ 0 penAI_gyn 

7 _Policy_gradient_saftnax 

S_Actor_Critic_Adv 3 ntage 

9 _Deep_l>et erninistic_P olle y_G r adie n t_OD PC 

(universe) abhi@ubuntu:^/Reinforcenent‘lcarning-with-tensorflow/contentsS | 


Figure 2-33, Getting inside the cloned repo 


Now you need to get inside the directory to run the program, as shown in Figure 2-34. 


(universe) dbhi#ubunto: -/Relnforcenent-learnlng-wlth-tensornow/contents/l_co«na 
nd_lint_relnforce«ent_ltarnlng$ dir 
treasure_on_right.py 

(universe) abhigubuntu : -/Relnforce«ent-learning-with-tensorflow/contents/l_con«a 
nd_\lne_relnforce«ent_learnlng$ 

Figure 2-34, Checking the directory 


Now you have to run the program called treasure_on_right. py, which places the 
treasure to the right of the agent. See Figure 2-35. 
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oonand_llne_rclnforce«ent_learnlng 

bash: cd: l_coo«and_llnc_relnforcencnt_laarnlng: No such file or dlrectory 
(universe) abhlfubuntui-yi^^^f^^o^^OCW^t-leiniFlrifi-iitttir-teRSOfflow/cooteittfS cd l_c 
o«and_llne_relnforcenent_learnlng 

bash: cd: I_conand_llne_relnforceRent_learnlng: No such file or dlrectory 
(universe) abhlfubuntu:-'/Relnforceiieint-iearnttig>-^th-tensorfloif/contcilt^$ cd l_c 
o««and_llne_relnforce«ent_learnlng 

(universe) abhl^ubuntu : -/Relnforcene_fiit-lear*ttttf«-wltli-teosorfloii»/cofiteiits/T_^^^s*i* 
nd_llne_relnforceRent_learnlng$ dlr 
treasure_on_rlght.py 

(universe) abhl§ubuntu :-/Relnforcenent-learn.ifif.-l*ltti[-t«n_^r1flo¥»/coiiteii4s/i^c©«fia 
nd_llne_relnforceRent_learnlngS python treasure_on_rIght.py 


Figure 2-35, Running the Python file 


The program is running iterations, as shown in Figure 2-36. 


:reasure_on_rlght.py 

[universe) abhlfubuntu :''/Relnforcenent-learnlng>wlth>tensorfloM/contcnts/l^CMia 
td line relnforcenent_learnlng$ python treasure_on_rIght.py 

°-Tr _:_:_ 

Figure 2-36, As the iteration happens 


As the program and the simulation complete, the final resuit is interpreted as a 
Q table, where on each step of completing the cycle, the values reflect how much time 
it spent in the left and right directions. Figure 2-37 shows the completed Q table. 



Q-table: 


left 

0 0.000001 

1 0.000271 

2 0.002454 

3 0.000073 

4 0.000810 

5 0.000000 


right 
0.005728 
0.032612 
0.111724 
0,343331 
0.745813 
0.000000 




Figure 2-37, The Q table created as a resuit 


What Is MDP? 

MDP (Markov Decision Process) is a framework that involves creating mathematical 
formulas and models for decision making where part of it is random and part of it 
remains in the hands of a decision maker. 
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MDPs have many different applications, as shown in Figure 2-38. 



Figure 2-38, MDP and its applications 

Every state in MDP satisfies the Markov property. 

The Markov Property 

In the World of Reinforcement Learning, the Markov property refers to a memory-less 
property that is stochastic. Stochastic means a general mathematical object consisting of 
random variables. When we are not storing a value of a variable because in each iteration 
there is a change, we call it stochastic. See Figure 2-39. 



Figure 2-39, The Markov property process 

We talk about the Markov Chain in the next section. 
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The Markov Chain 

If a mathematical property has either a discrete state space or a discrete index set, it is 
known as a Markov Chain. The Markov Chain works in two ways, as shown in Figure 2-40. 


- 

Markov Chain 

< _ _ 

1 



Markov process 



Discrete or continuous time 

with a countable state space 


Discrete time in either countabte 
or continuous state space 

- 





Figure 2-40, Markov Chain 

Let's look at Markov Chains using an example. This example compares sales of Rin 
detergent versus the other detergents in the market. Assume that sales of Rin is 20 percent 
of the total detergent sales, which means the rest comprise 80 percent. People who use 
Rin detergent are defined as A; the others are A'. 

Now we define a rule. Of the people who use Rin detergent, 90% of them continue to 
use it after a week whereas 10% shift to another brand. 

Similarly, 70% of the people who use another detergent shift to Rin after a week, and 
the rest continue to use the other detergent. 
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To analyze these conditions, we need a state diagram. See Figure 2-41. 



J 


Figure 2-41, Rin detergent state diagram 


In the state diagram, we have created a scenario where the circular points represent 
States. From this state diagram, we have to assign a transition probability matrix. 

The transition probability matrix we get from the state diagram is shown in Figure 2-42. 


A 


A' 


A 




.9 


J 


1 


3 


Figure 2-42. The transition probability matrix 
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To determine the use of Rin after two weeks, we have to apply a principle. This 
principle is common for each and every process you try. 

It can be shown as a line connection, as shown in Figure 2-43. 




Figure 2-43, A connected graph 


From the origin, we have two paths—one for Rin detergent (through A) and the other 
for the rest (that is A'). Here is how the path is created. 

1. From the origin, we create a path for A, so we have to focus on 
the transition probability matrix. 

2. We trace the path of A. 

3. From the starting market share, the detergent Rin has a 
market value of 20%. 

4. From the starting point A, we focus on the transition 
probability matrix. 

There is a 90% probability of staying on A, so the other 10% change to the alternate 
path (to A'). 
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Figure 2-44 shows this path calculation graphically. 




Figure 2-44, Path calculation 


The total path probability is determined as so: P = .2 *.9 + .8*.7 = .18 + .56 = .74. 
This is the percentage of people using Rin after one week. 

This formula can also be conceptualized as the current market share (SO) and 
transition probability (P): 

SO P = market share after one week 
See Figure 2-45. 




,9 


.7 


.1 


.3 


Figure 2-45, The matrix createdfor the next week 
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The calculation is .2 .9 + .S*.7 = .74 
.2M + .8^3=.26 
[ .74 .26 ] = SI 

Let's Work on a first state matrix. After one week, the sale of Rin detergent is 74% of 
the market. The other brands then make up 26% of the market. 

Now try to find the percentage of people using Rin detergent after two weeks. 
Figure 2-46 shows the calculation that we need to do after two weeks. 




.74 .26 




.9 


.7 


.1 


.3 


Figure 2-46, The next transition matrix 


So the resuit is: 

A A 

= [.848.152] 

After two weeks, 84.8% of the people will use Rin and 15.2% will use other detergents. 

One question you might have is whether the sale of Rin will ever maximize to 100% 
of the market. As we go along, the matrix will become stationary after a certain number of 
iterations and finally settle at: 

A A 

= [ .75 .25] 

After going through the basies of the Markov state and the Markov Chain, it's time to 
focus on MDPs again. 


MDPs 

Almost all Reinforcement Learning problems can be formalized as MDPs. MDPs create a 
condition that's prevalent for applying Reinforcement Learning. The essentials of MDPs 
are a continued Markov process. 
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A state (St) is Markov if and only if it meets the criteria shown in Figure 2-47. 

P[S,.i|S,| = PlSt.iIS,. s,l 

Figure 2-47. The Markov state property 


The state captures all relevant information from the history. We do not have to retain 
everything in history because only the previous state determines what will happen now. 

For a Markov state (s) and successor state (s'), the state transition probability is 
defined in Figure 2-48. 

Ps^ = P[S,.>=S' |S.„] 



Figure 2-48. The transitive probability 


MDP is a Markov reward process with a decision factor in it. It is a type of 
environment where all the States are Markov. 

An MDP is a five tuple < S, A, P, R, Gamma>: 

• S stands for state 

• A stands for action 

• P is a policy 

• R stands for reward 

Policy (n) is a distribution over actions in a given state. A policy is a function or a 
decision-making process that allows transitions from one state to another. 

SARSA 

SARSA stands for State Action Reward next State and next Action. It is a different kind 
of Reinforcement Learning approach and is generally derived from temporal difference 
learning. WeTI discuss temporal difference learning first. 

Temporal Difference Learning 

This type of learning is based on its own vicinity or its own range. We generally apply 
temporal difference learning when we are in a state and want to know what is happening 
in successive States. 
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The general idea is that we want to predict the best path over a period of time. 

We go from state SO to state SE We get rewards in each state. We will be trying to 
predict the discounted sum of rewards. See Figure 2-49. 

So- p© -► Si -Fi-► $2 - f7 —► Sf 

Figure 2-49, State transition 


We start by looking at the Markov Chain, as shown in Figure 2-50. 



Sto ch a Stic Transition 


Figure 2-50, The Markov Chain 


The equation States that the value function maps the state to some number. This 
number is set to 0 if it is in the final state (see Figure 2-51). 


V(s) = 0, if S= Sf 



E[r+v v{s')] 


otherwise 


Figure 2-51, The value function 
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For any state, the value is the expected value of the reward (r) and the discounted 
value of the ending state. 


How SARSA Works 

Now we get into SARSA. SARSA is known as an own policy Reinforcement Learning. An 
own policy means that we can see only our own experiences. 

It accumulates updates in one or more steps and learns to update from its 
experiences. 

From the current state, we choose an action and then get to the next state. At the next 
state, we choose another state and use the current state and the current action with the 
next state and next action. We then update all the values together as a Q value. 

Here is the algorithm: 

1. Initialize Q(s, a) arbitrarily. 

2. Initialize s. 

3. Choose a from s using the policy derived from Q. Repeat these 
two steps for each episode. 

4. Take action a and observe r and s! 

5. Choose a' from s' using the policy derived from Q (for 
example, —E-greedy). 

Q(s, a) 8i#x00DF;-Q(s, a) + a[r +yQ(s', a') - Q(s,a)] 

S8i#x00DF;—s'; a8i#x00DF;-- a'; 

6. Repeat these steps for each episode until s is terminal. 

Q Learning 

Q Learning is a model-free Reinforcement Learning technique. Figure 2-52 illustrates the 
general procedure for Q Learning. 
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Q Learning 


f 

Optimal action 
policy 

C_ _I 


t 

Finite MDP 


!_J 

Figure 2-52, The Q Learning process 

What Is Q? 

Q can be stated as a function that consists of two parameters—s and a. The a parameter 
can also be referred to as a table. 

Q represents the value that an action a takes with state s. 

Q[s, a] = Immediate reward + discounted reward 

The immediate reward is the point given when the agent moves from one state to 
another while doing an action. 

The discounted reward is the point given for future references. 


How to Use Q 

We generally come up with scenarios where we have to find out where we can utilize the 
Q table values or the Q value so Q is implemented in this process. 

We are looking at what action to take or which policy to implement when we are in 
state s. We use the Q table to get the best resuit. 
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If we are in state s, we need to determine which action is the best. We do not change 
s, but we go through all the values of a and determine which one is the largest. That 
will be the action we should take. Mathematically, this idea is represented as shown in 
Figures 2-53 and 2-54. 


n{s)= argmaxa (Q[s, a]) 

Figure 2-53, The policy equation 


n(a [s)=P[At=a |St=.] 



We decide whereto go 


Stochastic Matrix 


t 

When we have a policy we can say howthe agent will behave 

Figure 2-54, How policy works 


For MDP, the policy we should implement depends on the current state. We 
maximize the rewards to get the optimal solution. 


SARSA Implementation in Python 

Recall that SARSA is as self policy Reinforcement Learning approach. 
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For example, SARSA can be used to solve a maze. Using the SARSA approach, we 
cannot compare two different maze environments. We have to stick to one maze and wedl 
use the previous as an example. Also, we cannot compare this maze with another outside 
maze; we have to stick to the maze that we are working on. 

The best thing about SARSA is that it can learn from the current state compared to 
the next state or to subsequent States. We accumulate all the experiences and learn from 
them. 

Let's break this idea down more. This scenario States that the update can be done 
on a Q table by comparing the changes in subsequent steps and then making a decision. 
This idea is illustrated in Figure 2-55. 


Q-Table 


Q''table Update 

Looks for next state 

and next action 

Looksfor current and 

current act ion 


‘f- - 


-▼ 

table agatn update 

Next state and next action to 
be updates combines all of 
thevalues and then updates 

i 

The best update 
CT'" = Q 

Compares all and then 
updates 

■ ' — - 

Figure 2-55, Updating results using the SARSA table 
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The learning method in Python is different for SARSA. It looks like this: 
def learn(self, s, a, i, s_, a_) 

This method depends on the state, the action, the reward, the next state, and the next 
action. 

If we compare the algorithm and convert it to Python, the construet for this equation 
is shown in Figure 2-56. 

Q(s, a).^- Q(s,a)+a[r+YQ(s',a')-Q(s,a)] 

Figure 2-56, The SARSA equation 
It's converted to the following: 

q_target = r + self.gamma * self.q_table.ix [s_, a_] 

The difference between this equation and Q Learning is the change in this equation: 

q_target = r + self.gamma * self.q_table.ix [s_, :].max() 

The max () value is present in Q Learning but not in SARSA. 

The logic for implementing a policy using SARSA is shown here: 

# on-policy 

class SarsaTable(RL): 

def _init_(self, actions, learning_rate=O.Ol, reward_decay=0.9, e_ 

greedy=0.9): 

super(SarsaTable, self)._init_(actions, learning_rate, reward_ 

decay, e_greedy) 

def learn(self, s, a, i, s_, a_): 
self.check_state_exist(s_) 
q_predict = self.q_table.ix[s, a] 
if s_ != 'terminal': 

q_target = r + self.gamma * self.q_table.ix[s_, a_] # next 

state is not terminal 
else: 

q_target = r # next state is terminal 
self.q_table.ix[s, a] += self.lr * (q_target - q_predict) # update 

The learning process is somewhat different than with Q Learning. The logic works 
according to the principle discussed previously. 
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We combine the state and action of the current status with the next state and next 
action. This in turn updates the Q table. This is the way the learning works. 

def update(): 

for episode in range(lOO): 

# initial observation 
observation = env.reset() 

# RL choose action based on observation 
action = RL.choose_action(str(observation)) 

while True: 

# fresh env 
env.renderO 

# RL take action and get next observation and reward 
observation_, reward, done = env.step(action) 

# RL choose action based on next observation 
action_ = RL.choose_action(str(observation_)) 

# RL learn from this transition (s, a, r, s, a) ==> Sarsa 
RL.learn(str(observation), action, reward, str(observation_), 
action_) 

# swap observation and action 
observation = observation_ 
action = action_ 

# break while loop when end of this episode 
if done: 

break 

Here is the code for creating the maze: 

import numpy as np 
import time 
import sys 

if sys.version_info.major == 2: 

import Tkinter as tk 
else: 

import tkinter as tk 

UNIT =40 # pixels 

MAZE_H = 4 # grid height 
MAZE_1 aI = 4 # grid width 
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class Maze(tk.Tk, object): 

def _init_(self): 

super(Maze, self)._init_() 

self.action_space = ['u', 'd', '1', 'r'] 
self.n_actions = len(self.action_space) 
self.title('maze') 

self.geometry('{0}x{l}'.format(MAZE_H * UNIT, MAZE_H * UNIT)) 
self._build_maze() 

def _build_maze(self): 

self.canvas = tk.Canvas(self, bg='white', 

height=MAZE_H * UNIT, 
width=MAZE_lAl * UNIT) 

# create grids 

for c in range(0, MAZE_1 aI * UNIT, UNIT): 

xO, yO, xl, yl = c, 0, c, MAZE_H * UNIT 
self.canvas.create_line(xO, yO, xl, yl) 
for r in range(0, MAZE_H * UNIT, UNIT): 

xO, yO, xl, yl = 0, r, MAZE_H * UNIT, r 
self.canvas.create_line(xO, yO, xl, yl) 

# create origin 

origin = np.array([20, 20]) 

# hell 

helll_center = origin + np.array([UNIT * 2, UNIT]) 
self.helll = self.canvas.create_rectangle( 

helll_center[o] - IS, helll_center[l] - 15, 
helll_center[o] + 15, helll_center[l] + 15, 
fill='black') 

# hell 

hell2_center = origin + np.array([UNIT, UNIT * 2]) 
self.hell2 = self.canvas.create_rectangle( 

hell2_center[0] - 15, hell2_center[l] - 15, 
hell2_center[o] + 15, hell2_center[l] + 15, 
fill='black') 

# create oval 

oval_center = origin + UNIT * 2 
self.oval = self.canvas.create_oval( 

oval_center[0] - 15, oval_center[l] - 15, 
oval_center[0] + 15, oval_center[l] + 15, 
fill='yellow') 
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# create red rect 

self.rect = self.canvas.create_rectangle( 
origin[0] - 15, origin[l] - 15, 
origin[0] + 15, origin[l] + 15, 
fill='red') 


# pack all 

self .canvas.packO 

def reset(self): 
self .updateO 
time.sleep(0.5) 
self.canvas.delete(self.rect) 
origin = np.array([20, 20]) 
self.rect = self.canvas.create_rectangle( 
origin[0] - 15, origin[l] - 15, 
origin[0] + 15, origin[l] + 15, 
fill='red') 

# return observation 

return self.canvas.coords(self.rect) 

def step(self, action): 

s = self.canvas.coords(self.rect) 
base_action = np.array([0, O]) 
if action == 0: # up 

if s[l] > UNIT: 

base_action[l] -= UNIT 
elif action == l: # down 

if s[l] < (MAZE_H - 1) * UNIT: 
base_action[l] += UNIT 
elif action == 2: # right 

if s[0] < (MAZE_1 aI - 1) * UNIT: 
base_action[0] += UNIT 
elif action == 3: # left 

if s[0] > UNIT: 

base_action[0] -= UNIT 

self.canvas.move(self.rect, base_action[o], base_action[l]) # move 
agent 

s_ = self.canvas.coords(self.rect) # next state 

# reward function 

if s_ == self.canvas.coords(self.oval): 
reward = 1 
done = True 

elif s_ in [self.canvas.coords(self.helll), self.canvas.coords 
(self.hell2)]: 
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reward = -1 
done = True 
else: 

reward = 0 
done = False 

return s_, reward, done 

def render(self): 
time.sleep(O.l) 
self .updateO 


The Entire Reinforcement Logic in Python 

When you are implementing the algorithm in Python, the structure looks like the 
following. The content is in the repo. 

import numpy as np 
import pandas as pd 

class RL(object): 

def _init_(self, action_space, learning_rate=O.Ol, reward_decay=0.9, 

e_greedy=0.9): 

self.actions = action_space # a list 
self.lr = learning_rate 
self.gamma = reward_decay 
self.epsilon = e_greedy 

self.q_table = pd.DataFrame(columns=self.actions) 

def check_state_exist(self, state): 
if state not in self.q_table.index: 

# append new state to q table 
self.q_table = self.q_table.append( 

pd.Series( 

[o]*len(self.actions), 
index=self.q_table.columns, 
name=state, 

) 

) 

def choose_action(self, observation): 
self.check_state_exist(observation) 

# action selection 
if np.random.randO < self.epsilon: 

# choose best action 
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state_action = self.q_table.ix[observation, :] 
state_action = state_action.reindex(np.random.permutation(state_ 
action.index)) # some actions have the same value 
action = state_action.argmax() 
else: 

# choose random action 
action = np.random.choice(self.actions) 
return action 

def learn(self, *args): 

Pass 

# off-policy 

class QLearningTable(RL): 

def _init_(self, actions, learning_rate=O.Ol, reward_decay=0.9, e_ 

greedy=0.9): 

super(QLearningTable, self)._init_(actions, learning_rate, reward_ 

decay, e_greedy) 

def learn(self, s, a, r, s_): 
self.check_state_exist(s_) 
q_predict = self.q_table.ix[s, a] 
if s_ != 'terminal': 

q_target = r + self.gamma * self.q_table.ix[s_, :].max() # next 

state is not terminal 
else: 

q_target = r # next state is terminal 
self.q_table.ix[s, a] += self.lr * (q_target - q_predict) # update 

# on-policy 

class SarsaTable(RL): 

def _init_(self, actions, learning_rate=O.Ol, reward_decay=0.9, e_ 

greedy=0.9): 

super(SarsaTable, self)._init_(actions, learning_rate, reward_ 

decay, e_greedy) 

def learn(self, s, a, r, s_, a_): 
self.check_state_exist(s_) 
q_predict = self.q_table.ix[s, a] 
if s_ != 'terminal': 

q_target = r + self.gamma * self.q_table.ix[s_, a_] # next 

state is not terminal 
else: 

q_target = r # next state is terminal 
self.q_table.ix[s, a] += self.lr * (q_target - q_predict) # update 
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The learning process in its entirety looks like this in the code (RL_brain. py): 

from maze_env import Maze 
from RL_brain import SarsaTable 

def updateO: 

for episode in range(lOO): 

# initial observation 
observation = env.reset() 

# RL choose action based on observation 
action = RL.choose_action(str(observation)) 

while True: 

# fresh env 
env.renderO 

# RL take action and get next observation and reward 
observation_, reward, done = env.step(action) 

# RL choose action based on next observation 
action_ = RL.choose_action(str(observation_)) 

# RL learn from this transition (s, a, r, s, a) ==> Sarsa 
RL.learn(str(observation), action, reward, str(observation_), 
action_) 

# swap observation and action 
observation = observation_ 
action = action_ 

# break while loop when end of this episode 
if done: 

break 

# end of game 
printCgame over') 
env.destroyO 

if _name_ == "_main_ 

env = Maze() 

RL = SarsaTable(actions=list(range(env.n_actions))) 

env.after(lOO, update) 
env.mainloopO 
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Let's run the program and check it. 

You can do this in the Anaconda environment, as shown in Figure 2-57. 



Figure 2-57, Activating the environment 


You then have to consider the SARSA maze, as shown in Figure 2-58. 







abhl^ubuntu; ^/Relnrorcement-leamlng-with-tensorriow/contents 


exariples. desktop Untltledi. ipynb 

gyn Urtitled.Ipyrb 

rtuslc Vtdeos 

(universe) abhl^ubuntu cd Relnforcenent-Iearnlng-wlth-tensorflow 
Cuniverse) abhl^ubuntu i^/Relnforcepient-learnlng-wlth-tensorftoi^S dlr 
dontents experlnents LICENCE READHE.nd RL_Cover*jpg 

(universe) abhl^ubuntu :"'/Relnforcenent-learnlng-wlth-tensorflowS cd contents 
(universe) abhl@ubuntu i-^/Relnforcenent-Iearning-wlth-tensorflow/contentsS dlr 
ie_A5C 
ll_Dyna_Q 

12_ProxlpiatPollcy_0ptlnl;atlon 

i_cortnand_llne_relnf 0 rcenent_lea rnl ng 

Z_Q_Learning_piaze 

3_S3rsa_m3ze 

4_Sa rsalanbdana ze 

5.1_Double_00N 

S*2_Prlorlttzed_fieplay_DQN 

5.3_Duellng_0QN 

s_oeep_Q_Network 

6_0penAI_gypi 

7_Policy_9r3dlent_soft(iax 

SActorCritlcAdvantage 

9_0eep_Deternlnlstlc_Pollcy_Cr3dlent_DDPG 

(universe) abhli^ubuntu :-/Relnforcencnt-lcarning-wlth-tcnsorflow/contentsS | 


Figure 2-58, Considering the SARSA maze 


Nowyou have to call the run_this. py file to get the program running, as shown in 
Figure 2-59. 


(universe) abhi{|ubuntu;'“/Retnforcenent-le3rntn9-wlth-tensorflow/contentsS cd 3_S 
arsanaze 

(universe) abhl@ubuntu : "'/RelnforcePient-learnlii9-wlth-tensorflow/contents/3_Sar5a 
^rtazeS dlr 

naze_env.py pycache_ RL_brain.py runthls.py 

(universe) abhi@ubuntu : "'/Relnforcenent-learnin9-with-tensorflow/contents/3_S3rsa 

_n3zeS I 

Figure 2-59, Running run_this.py 
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To run the program from the terminal, use this command: 


python run_this.py 


After running the code, the program will play the maze, as shown in Figure 2-60. 




(universe) abhigubuntu :-/Reinforcenent-Iearntng-wlth'^tensorflow/contentsS dlr 

ia_A3C 

ll_0yn3_Q 

l2_Proxlm3l_PoXlcy_Opttnlz3t1.on 
i_co(*n»iand_llne_re1.nforcenent_learnlng 
z_Q_Learnln9_i»iaze 
3 Sarsa_(naze 

Sarsa_lanbda_nazG 
5.1_DoubXe_DQN 
5,2_Prlorltlzed_Repl3y_D0hl 
5*3_Duelln9_DQN 
s_oe e p_Q_N e two r k 
_OpenAl_gyrt 

_Poltcy_gradlent_softnax 
Actor Crltlc Advantage 
_Deep_Deter(ninlsti.c_Policy_Gradtertt_OOPG 


F ^ 

maze 





□ 


□ 




o 







jtents$ cd 3 S 



rsa_(*iaze 

(universe) abhVgubuntu : ^/Relnforcenent-learnlng-wlth-tensorflow/contents/3_Sarsa 
_nBie$ dir 

piaze_env.py _pycache_ RL_brain.py run_thls.py 

(universe) abhlgubuntu :"'/Relnfc>rceneiit'\earnln9-wlth-tensorflaw/cpntents/3_sarsa 
nazeS python run_th1.s.py 


Figure 2-60, The program playing maze 

Dynamic Programming in Reinforcement 
Learning 

Problems that are sequential or temporal can be solved using dynamic programming. 
If you have a complex problem, you have to break it down into subproblems. Dynamic 
programming is the process of breaking a problem into subproblems, solving those 
subproblems, and finally combining them to solve the overall problem. The optimal 
substructure and the principle of optimality apply. The solution can be cached and 
reused. See Figure 2-61. 
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solution 


Figure 2-61, Dynamic problem-solving approach 

Conclusion 

This chapter went through different algorithms related to Reinforcement Learning. You 
also saw a simple example of Reinforcement Learning using Python. You then learned 
about SARSA with the help of an example in Python. The chapter ended by discussing 
dynamic programming basies. 
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CHAPTER 3 



This chapter introduces the world of OpenAI and uses it in relation to Reinforcement 
Learning. 

First, we go through environments that are important to Reinforcement Learning. We 
talk about two supportive platforms that are useful for Reinforcement Learning—Google 
DeepMind and OpenAI, the latter of which is supported by Elon Musk. The completely 
open sourced OpenAI is discussed in this chapter and Google DeepMind is discussed in 
Chapter 6. 

The chapter first covers OpenAI basies and then moves toward describing them and 
discusses the OpenAI Gym and OpenAI Universe environments. Then we cover installing 
OpenAI Gym and OpenAI Universe on the Ubuntu and Anaconda distributions. Finally, we 
discuss using OpenAI Gym and OpenAI Universe for the purpose of Reinforcement Learning. 

Getting to Know OpenAI 

To start, you need to access the OpenAI web site at https: //openai.com/. 

The web site is shown in Figure 3-1. 


• O I * cfurHi-ura 



OpenAI 

Dis^covering and «nacling 
the path to safe arlificial 
genera I intelligance. 






Figure 3-1, The OpenAI web site 


© Abhishek Nandy and Manisha Biswas 2018 
A. Nandy and M. Biswas, Reinforcement Learning, 
https://doi.org/10.1007/978-l-4842-3285-9_3 
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The OpenAI web site is full of content and resources. It has lots of resources for you 
to learn and research accordingly. Let's see schematically how OpenAI Gym and OpenAI 
Universe are connected. See Figure 3-2. 



Figure 3-2, OpenAI Gym and OpenAI Universe 

Figure 3-2 shows how OpenAI Gym and OpenAI Universe are connected, by using 
their icons. 

The OpenAI Gym page of the web site is shown in Figure 3-3. 




)penAl 


A loofkit for developlng and comporirtg relnforcefnent 
lectrning algonlilims. tt ^upports teaching agenLs 
C'k'GrYthirig tram waiking lo pbylng gamcs like Pong 
Of Go. 




Figure 3-3, OpenAI Gym web site 


OpenAI Gym is a toolkit that heips you run simulation games and scenarios to 
apply Reinforcement Learning as well as to apply Reinforcement Learning algorithms. It 
supports teaching agents for doing lots of activities, such as playing, waiking, etc. 
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The OpenAI Universe web site is shown in Figure 3-4. 




Learn, Grow. Get Smarter. 

SotvB sucC^uivety Hdrder erwtronnlenTs To 
dewJop 9G>nerdl pfodlem sotving dbilny. 



Figure 3-4, The OpenAI Universe web site 


OpenAI Universe is a Software platform that measures and trains an ATs general 
intelligence across different kinds of games and applications. 

Installing OpenAI Gym and OpenAI Universe 

In this section, you learn how to install OpenAI Gym and OpenAI Universe in an Ubuntu 
machine using version 16.04. 

Go into the Anaconda environment to install OpenAI Gym from GitHub. See 
Figure 3-5. 


(universe) openaiwubuntui-S cd - 

(universe) openaxyubuntui-S git clone https://glthub.con/openal/gyn.glt 

Clonlng Into 'gyn'... 

renote: Countlng objects: 5901, done. 

renote: Total 5961 (delta 6 ), reused 6 (delta 6 ), pack-reused 5961 
Recelvlng objects: 166X (5961/5961), 1.46 MlB | 437.66 KlB/s, done. 
Resolvlng deitas: 166% (3977/3977), done. 

Checklng connectlvlty... done. 

(universe) opena\( 0 ubuntu:-'$ | 

Figure 3-5, Cloning OpenAI Gym 


You can clone and install OpenAI Gym from GitHub using this command: 

$ source activate universe 
(universe) $ cd ~ 

(universe) $ git clone https://github.com/openai/gym.git 

(universe) $ cd gym 

(universe) $ pip install -e '.[all]' 
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Now install OpenAI Universe as follows: 


(universe) 

(universe) 

(universe) 

(universe) 


$ cd ~ 

$ git clone https://github.com/openai/universe.git 
$ cd universe 
$ pip install -e 


The packages are being installed. Figure 3-6 shows the cloning process for OpenAI 
Universe. 


cd gyn 

([uniuvifArJ' Dpcrii 1^buntLJ :-■-/‘-gypiS pIS-P "Ifiit-ali 'litij- 

Sbtdlning ft]<smefop^n»ifq'^n 

icqutrrncnt nynpy?-l. l#. 4 tn /honc/n|iflrjl/j|n* *ttmJ*5/rnws/yriv¥r|fff^Vtb/pytFionjJ-< frpn 

P.»Qii1.r?n«nt dtfe^dy requestsi-.Z.fl' In /hopw/of^efijl/jnKcndaJ/tnvs^urlvierEe/llb/pythDnJ. S/slce-iuckaigcs <rron qyn..d.4.2) 

qlrCddy tix tr (Tron 

:all«tl.ng pyglet?-! .2^6 (fron gyn^-fi.y.a) 

PPMfllpjd tng pyalc t ^ 1. Z. 4 ■ py3 ■ nonf ■ ativ. wb l {961kB)' 

1«N 973kQ 

2oU«ttng 1J (fron gyr*==P.y.Z) 

um 1.1 HB 

CflU tCltrftg l;*r«S Cfrwi (yurrrre , 3 . J > 

Ommlo. «bl (27iikB> 

'alV-Htlng the^nq Cfr«A fyn.^.Q. 9.2) 

Pdw»^Q< d tpg thcanQ.p,?,6 rtar.gr <a,IPP) 

lan 3 . 1 HB 

2oU«ttng nuJocojjy^i.erB.i=S . ^ . 3 tfrot* 9y^==0.9.2> 


Figure 3-6, Cloning OpenAI Universe 


The entire process, with all the important files, is downloaded, as shown in Figure 3-7. 


*;r**dy ^!H«l/«pfn«l/«iitcervj4j/trwf/iinitytrj«/tlb7pytFk«nJ,Salite'pKli(«g#f (frfln 

ntqylrenchc 4lr«tdr pyyMl. Ia ilKAe/tipfAil/AuepnsHiiytnvi/ijnivfratA^Lb^pyTlwnS.S/a.iir pKkdffa k«r«i-^3^—9^9.2) 

RcqulrrninC ilr«*qy Ulpy.-Hi.t4 lA /'h^riTcpcrMl^in-icpndiJ/etwiyunlv«ri■/l.lb^pylhwP'!t.l«<pick]i;«i (fron kici‘ji.»gyn-'-fli.!.Z] 

Aulldlng ultMl-^ Tai cotltct*!] pacbi-py, tlwia^ nyjAca.py, lPug*lD^ biilCh. kMtji,. fyap*n£l., 4t4r( ‘py 

ttubnlAd ULup.py Mill.,iiNnlL fqir pMM-py ,.- 4aA« 

SlDred i.n dl recEM- y: /n[ine/o(]««iAi/ .C4ch«/plp /yibeE-Li/^Ba 4Ei/]J/13S1C aSc2fibdd73cE*E9 ^ECcaLcdihZEZdlHaH SdUj IfESlbP-f 
RunnlAg Ef-tMp.py bdlEE_iilvf»1i. Tac ihAfAp ... dnru 

^lOrEd lin di*rtEo*yr /br4.E/o!jfE"4l/.t«hT/plp/nbEEls./d1i/Sb/9'!F/*JJ2S9tH*EJ.E9bZJCdfWflr^riEbfliEiS^ljTlJj^illH^lMli 
Rurninp lEtup.py bdiit nbetl fer nu)[KD-py ... dnoe 

Stered 1,n dlr^Etflry: ZHnnE/dpeert/.«■Etit/p^.pfT.h^tljffi8/4T7Si/MH9rbJSdddSlcdSlZfZddbWJCblf9881 rHEblefflEI3S18 
RUbAlna t»EUp.py bdllE.-nbMl Tat 1^19*10 ... d»n« 

ItorEd in dlrecEDry: /hane/afinift/.cjclw/pipAtbettiflai/HT^cr/TcirSHeilcabH^CEbS^drrr^cZJZTTdqdtbjAdfddZEZ-IIfltldtK) 

RyfthlAH Jetgp.py bdlH;_"bEdl, fsr idrJO^fcEnfll ddnd 

Siprad In dl mt. Edr y a /hMH/dpcd*!/ .Ed4h«/piE/iibet li,/19y 72392213Jct.fi(cia9eE.eddni42Tb492«12c81bbE4 J9923 

Rurnlng SEEup.iry bdlfit '.b-ea-E for Pii'dt>«n&L ... dnne 

itpr «d in dl r#c t H y: /honf /c^moaI / .cjt ba/pip/.iliral i/lc/lT^Sd/ fdIdH tfAadl dtndSafSiildTdlTATJt»] baBSdtfrtSdfdZI 194 
RuE.hloa KEup.py bdllt_.hddl fpr Atdrt^py ... 4nn« 

ilDrcd In dl rcc-EH- y: /honcZoiicfiJ'i/ .OEoho/pip /'.bEEEi/bda 9»/3 I/oHe 73 fZdlZdllESH^lflRHlC^E 14ZoJe74ab93 E'9cScdl27B3ci 
lifEcasirully bullt piEhl-py Ihoarta hjJud- py' Ini^lo KoiHJDi.kiAgy RyOponCl.. Jtiri.py 

]r>il4lllng tollrttEd p^tE4gi^^; psalrt. p«.hl-py, Lej «>. ny<J|>i»iC4.. "AfiiKO-py. l"4flrlov bprZO-iri^Z. *t4i l-[py. sy» 

Rurnlng iEtup.p.y dEvelop for gr)'n 

^iKSPSjtutljf mstdUdU BoHjfl Rdrtgy J.3.J Pyljpdnti J. 1 .d 4H.rl .py 9.1.1 gyn iFwjgplo.i.J.P kerji 1.9.9 nujudo py 9.5-7 neM pv 9.9.31 pyglpE 1-3.4 the^no 9.9.9 

(unlydfii) 9 M 9 »U|ubu 4 ^d;'-j'gyAV | 


Figure 3-7, Important steps ofthe installation process 


The process installation continues, as shown in Figure 3-8. 
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Figure 3-8, More steps ofthe installation process 
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In the next section, you leam how to start working in the OpenAI Gym and OpenAI 
environment. 

Working with OpenAI Gym and OpenAI 

The OpenAI cycle for a sample process is shown in Figure 3-9. 


Import gym 


t 



Figure 3-9, The Basic OpenAI Gym structure 


75 






CHAPTER 3 ■ OPENAI BASICS 


The process works this way. We are dealing with a simple Gym project. The language 
of choice here is Python, but we are more focused on the logic of how an environment is 
being utilized. 

1. We import the Gym library. 

2. We create an instance of the simulation to perform using the 
make function. 

3. We reset the simulation so that the condition that we are going 
to apply can be realized. 

4. We do looping and then render. 

The output is a simulated resuit of the environment using OpenAI Reinforcement 
Learning techniques. 

The program using Python is shown here, whereby we are using the cart-pole 
simulation example: 

import gym 

env = gym.make('CartPole-vO') 
env.resetO 
for _ in range(lOOO): 
env.renderO 

env.step(env.action_space.sample()) # take a random action 

The program that we created runs from the terminal; we can also run the program on 
a jupyter notebook. Jupyter notebook is a special place where you can run Python code 
very easily. 

To use the properties or the file structure of OpenAI, you need to be in the universe 
directory, as shown in Figure 3-10. 


Termina Terminal Fale edat View Search Help 

abhipubuntut'^/universe 


abhi@ubiuintu :^$ source actlvate universe 
(untverse) abhl^gubuntui^S cd universe 
ICuntverse) abhigubontur^/universeS | 


^ 2:43 AM iit 


=Q 


Figure 3-10, Inside the universe directory 
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To Work with the Gym components, you need to get inside the gym directory, as 
shown in Figure 3-11. 



Figure 3-11, Inside the gym directory 


You then need to open the jupyter notebook. Enter this command from the terminal 
to open the jupyter notebook (see Figure 3-12): 

jupyter notebook 
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TermEna Terminal File Edit View Search Terminal Help ^ 

~^r abhE(^ubuntu: 


t* ^ 12:38 PM i|> 





abhtgubvi4t«t-S source actlvate universe 
(untverse) sbhiaubuntu:-S cd gyj»! 

(universe) abbigubuntu:-/gyi*iS Jupyter notebook 





Figure 3-12. Using thejupyter notebook 


When you issue the command, the jupyter notebook engine side-loads essential 
components so that everything related to the jupyter notebook is loaded, as shown in 
Figure 3-13. 



Figure 3-13. The essential components of jupyter notebooks 
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Once the jupyter notebook is loaded, you will see that the interface has an option for 
working with Python files. The type of distrihution you have for Python is shown in the 
interface. Figure 3-14 shows that Python 3 is installed in this case. 


Uptoad 

> ■ 

New V 


NiOteDOC^ 


Python 3 

♦ 

0'*Mer: 


Text FWe 


}0 

Fotder 



Terminat 




4 days ago 


Figure 3- 1 4, Open ing a new Python file 

You can now start working with the Gym interface and start importing Gym lihraries, 
as shown in Figure 3-15. 


ii ^ 4- y> = 

1^ Logou 
Trusled ^ \ pyifton 3 C 



in [2]!; env = gym.make{ 'CartPole-vG' ) 

(2017-09-67 12:45:13,631] Making new env: CartPole-vG 


Figure 3-15, Working with Gym inside the jupyter notebook 


The process continues until the program flow is completed. Figure 3-16 shows the 
process flow. 
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uncicledi -MorlUa Firefox 

a bo u restore 




=a 


Home 


untitledi 
G^ Q, Search 


t|. i ^l:47PM ^ 

+ 


' 'T' EocaIhostfifiSB/notebooks/UrititIedi.i 

jupyter Untitledi 

Fkle EOrt View Ireeft Ce\\ Kefnei Help 

El + MHC Code J ra 

in [1]; import gyrn 

In (zj: env = 9 ym*make( ’CartPole-v0' > 

[ 2 ei 7 ’e 9'07 12 : 45 : 13 * 631 ] Making new env: cartPole-ve 


In l 1: for _ in rangedoeeh 
env. rervdert) 

env. step ■ erv. acti on_s pace. saniple() 


☆ ^ » = 

Logwt 

Trusted / [ python 3 O 


Figure 3-16, Theflow ofthe program 


After being reset, the environment shows an array, as shown in Figure 3-17. 



e:23AM t;> 


it ^ 4 » = 



c; jupyter unfit!ed4 iji 

File EdiI Vtew Inseri Cdl Kemol Widgels Hlolp jt \ Pylhon [dclaiJL] O 

B ^ ^ HHC Code j E3 CeUTooltjaf ^ fi O 



In [IJ: iiifxirt gyin 

in [2J: env = gym.makef 'CartPole-ve' ) 

[2017-09-11 08:22:50*9501 rtaking new env; CartPole-v0 

in [3]: env.reset C) 

Outll]: array((-9.01605733* 0.01950176* -0.02306482* 0.02559062] J 



In [ ]: 



Figure 3-1 7. An array is being created 
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Figure 3-18 shows the simulatiori. The cart-pole shifts by a margin that's reflected by 
the arra/s values. 



Figure 3-18, The simulatiori in action 


More Simulatioris 

This section shows you how to try different simulations. There are many different 
environment types in OpenAI. One of them is the logarithmic type, discussed next. 

There is variety of tasks involved in algorithms. Run this code to include the 
environment in the jupyter notebook (see Figure 3-19): 

import gym 

env = gym.make('Copy-vO') 

env.resetO 

env.renderO 


Era 




File Edit view Hijtory Bookmarks Tools Help 


t| 9:41 PM -ijl 


3 b 0 u t:^ess i on restore 


Horne 


^ J untitledS 



cf Q, Search 



T loc3lhost;8S88/not:?bcok$/unC[i:led3 

C jupyter unwieds 

Rle Edit VI ew Ins&rt CXI Kemel WicJgets .Help 


☆ ^ 4 » = 


Python [c(3nda env:unlvefie] O 


e 

+ 





4^ 

H 

■ 

C 

Codq J 


CeliTooibar 

A 

fl 

a 


In [IJ: itkport gym 

in [2]: env = gym.make( 'Copy-vS’ ) 

12817-09-12 21:39:35,7891 Haking new env: Copy-v© 

In [3]: env.tesetl) 

Out[3]: 1 


In [ ]: env.render 


Figure 3-19, Including the environment in the jupyter notehook 
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The output looks like Figure 3-20. The prime motive for this simulatiori is to copy 
symbols from an input sequence. 



V%M k. I _■ J * K. 


in [6]: ervv. renderC t 

Total leagth of input instance: 2 , step: 9 


Observation Tap€ i |c 

Output Tspe : 

TargetS : CC 


Figure 3-20, The output after running the renderfunction 


This section uses an example of classic arcade games. First, open the required 
Anaconda environment using the following command: 

source activate universe 

Then go to the appropriate directory, say gym: 

cd gym 

From the terminal, start the jupyter notebook using this command: 
jupyter notebook 

This enables you to start working with the Python option. Figure 3-21 shows the 
process using the classic arcade games. 



jupyter Untitled9 

Ftie Edit View tns^rt Cell Kernel Widgets H&ip 

B + ^ code j =J 

In U |: iinport gym 

In (2]: env = gym.nBke( 'SpaceinvaderS‘V0' ^ 

[2917-09-12 22:08:92Haking new env: Spacelnvaders-vO 

Figure 3-21 , Using classic arcade games 



Logout 

Tmsted ^ | Python 3 O 
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After using env. reset(), an array is generated, as shown in Figure 3-22. 



Figure 3-22, The array is being created 
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Ifyou use env.render(), you'11 generate the output shown in Figure 3-23. 


tQ. 


^ (D localhost:S38S/r€!!LCiJOOk?i/lfrlit.L:d^ 


C Q. Search 


P JUpyter Untitled9 

File Em View Insert Cell 


☆ ^ » 




+ 


Jb : 

^ 4^ 

H 

■ C 


Kernel 
Codi 


Wldgeis 


d 


hle^p 


Trusted 




:a3 



[80. 89, 22). 

[80, 89, 22), 

[80. 89, 22)], 

[[80. 89, 22). 

[80, 89. 22). 

[30, 89, 22), 

[ 80 ! 89, 22). 

[80, 89, 22), 

[80, 89, 22])], dtype=uint8) 


/home/abhi;/ 


A 'if>i A A 

ix; X d V 

"R: fl: p: ^ 

ff f? f? ff fi R 

rt' rtf rtf fr 


in [4]^ env^render 
In [ ]; 


PyttK 


Figure 3-23, Rendering the output 


This example is simply simulating different kinds of game environments and setting 
them up for Reinforcement Learning. 

Here is the code to simulate the Space Invaders game: 

import gym 

env = gym.make('SpaceInvaders-vO') 

env.resetO 

env.renderO 

In the next section, you will learn how to work with OpenAI Universe. 

OpenAI Universe 

In this example, you will be using the jupyter notebook to simulate a game environment 
and then will apply Reinforcement Learning to it. Go to the universe directory and start 
the jupyter notebook. 

import gym 

import universe # register the universe environments 
env = gym.make('flashgames.DuskDrive-vO') 
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env.configure(remotes=l) # automatically creates a local docker Container 
observation_n = env.reset() 

while True: 

action_n = [[('KeyEvent', 'ArrowUp', True)] for ob in observation_n] # 
your agent here 

observation_n, reward_n, done_n, info = env.step(action_n) 
env.renderO 

Figure 3-24 shows the code needed to set up the environment for the DuskDrive 
game. 


untldedi - MozfUa Firefox 




1 loc^lhost n ^aok^’/UntiLlctj 





Q. Scarcfi ☆ ^ 4 » = 



P jupyter Untitiedi 


FjJe 

Fdit 

View 

insert 

oell Kemei Wldgeis 

Help 

El + 




N ■ C Code J 



'1^ Logcrnc 
Tru£led ✓ |Pyi:tK)rt3 # 


in (1): 

In [2]: 

in [3]: ertv - gym.maket 'llasb^ames.DuskDrive-ve' ) 

12917-09-13 04:35:02,675] Making new env: flash9afrtes,DuskDrive-v0 

In [*]: env-configure remotes=l 

(3917-09-13 04;35:33,855] Writing logs tn file: /tmp/universe-3753.lo 
9 

[2917-09-13 04:35:25,017] Ports used: dict_keys([]J 
[2017-09-13 04:35:25,035] [0] Creating Container: image-quay.io/opena 
i/universe.flashgames:0.20.33. Run the same thing by hand as: docker 
run -p 5999:5909 -p 15990:15996 -privileged -cap-add SYS_A0HrN -ip 
c host quay.io/openai/universe.flashgames:9.20,28 

[3017-09-13 04:35:25,333] Image quay.io/openai/^universe.flashgames:0. 
20.2E fiot present locallyj pulling 


iwport gym 
inport universe 


Figure 3-24, Setting up the environment for the DuskDrive game 


Now it will access the image and start the image remotely. It will run the game and 
start playing remotely with the help of an agent. See Figure 3-25. 
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-> o 0 locjihoa 




flasihg^itiies.DuskDnve-vO 

Thk gamt ks b^ing played by an AI. Thi^ farowsiet js noijiisi far ycHi- whac [llke AI it*i loa. Ymi ran play ihe origLridl gam*^ btcie: 
http:'>'h'H>'.kiDngregatc.cDni''gdiiH^lDngAnLrTuis<'ctuskHdjive 

Pressed ke}’^: ArrowUp Mouse: Jt=0 y=<J 


Figure 3-25, The game played by the agent 

First, you import the gym library, which is the base on which OpenAI Universe is 
built. You also must import universe, which registers all the Universe environments. 
You import the gym library, as you will simulate on OpenAI Gym and Universe: 

import gym 

import universe # register the universe environments 

After that, you create an environment for loading the Flash game that will be 
simulated (in this case, the DuskDrive game). 

env = gym.make('flashgames.DuskDrive-vO') 
env = gym.make('flashgames.DuskDrive-vO') 

You call conf igure, which creates a dockerized environment for running the 
simulation locally. 

env.configure(remotes=l) 
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You then call Env. reset () to instantiate the proper simulatiori environment 
asynchronously 

observation_n = env.reset() 

You then define the keyEvent and Arrowup actions to move the car in the simulated 
environment: 

action_n = [[('KeyEvent', 'ArrowUp', True)] for ob in observation_n] 

To get rewards and to check the status of the episodes, you use the following code 
and render accordingly. 

observation_n, reward_n, done_n, info = env.step(action_n) 
env.renderO 


Conclusion 

This chapter explained the details of OpenAI. First, it described OpenAI in general and 
then described OpenAI Gym and OpenAI Universe. 

We touched on installing OpenAI Gym and OpenAI Universe and then started 
coding for them using the Python language. Finally, we looked at some examples of both 
OpenAI Gym and OpenAI Universe. 
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CHAPTER 4 


Applying Python to 
Reinforcement Learning 


This chapter explores the world of Reinforcement Learning in terms of Python. First 
we go through Q learning with Python and then cover a more in-depth analysis of 
Reinforcement Learning. We start off by going through Q learning in terms of Python. 
Then we describe Swarm intelligence in Python, with an introduction to what exactly 
Swarm intelligence is. The chapter also covers the Markov decision process (MDP) 
toolbox. 

Finally, you will be implementing a Game AI and will apply Reinforcement Learning 
to it. The chapter will be a good experience, so let's begin! 

Q Learning with Python 

Let's start with a maze problem. The object of the game is to reach the yellow circle while 
avoiding the black squares. Figure 4-1 shows the maze. We use the numpy library in this 
example. 


© Abhishek Nandy and Manisha Biswas 2018 
A. Nandy and M. Biswas, Reinforcement Learning, 
https://doi.org/10.1007/978-l-4842-3285-9_4 
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Figure 4-1, The maze that demonstrates Q learning 

We have to choose an action based on the Q table, which is why we have the function 
called choose_action. When we want to move from one state to another, we apply the 
decision-making process to the choose_action method as follows. 

def choose_action(self^observation): 

The learning process function takes the transition from state, award, reward and goes 
to the next state. 

def check_State_exist(self,state) 

The check_State_exist function allows us to check if the state exists and then to 
append it to the Q table if it does. 

The content of the function we have discussed is actually for RL_brain, which 
is the basis of the project. The rules are updated for Q learning, as shown in the run 
_this.py file. 
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The Maze Environment Python File 

The maze environment Python file, shown here, lists all the concepts for making moves. 
We declare rewards as well as ability to take the next step. 


II II II 

Reinforcement learning maze example. 


Red rectangle: 
Black rectangles: 
Yellow bin circle: 
All other States: 


explorer. 

hells 

paradise 

ground 


[reward = -l]. 
[reward = +l]. 
[reward = 0]. 


This script is the environment part of this example. The RL is in RL_brain. 
py. 


View more on my tutorial page: https://morvanzhou.github.io/tutorials/ 

II II II 


import numpy as np 
import time 
import sys 

if sys.version_info.major == 2: 

import Tkinter as tk 
else: 

import tkinter as tk 

UNIT =40 # pixels 

MAZE_H = 4 # grid height 
MAZE_1 aI = 4 # grid width 

class Maze(tk.Tk, object): 

def _init_(self): 

super(Maze, self)._init_() 

self.action_space = ['u', 'd', '1', 'r'] 
self.n_actions = len(self.action_space) 
self.title('maze') 

self.geometry('{0}x{l}'.format(MAZE_H * UNIT, MAZE_H * UNIT)) 
self._build_maze() 

def _build_maze(self): 

self.canvas = tk.Canvas(self, bg='white', 

height=MAZE_H * UNIT, 
width=MAZE_lAl * UNIT) 
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# create grids 

for c in range(0, MAZE_1 aI * UNIT, UNIT): 

xO, yO, xl, yl = c, 0, c, MAZE_H * UNIT 
self.canvas.create_Iine(xO, yO, xl, yl) 

for r in range(0, MAZE_H * UNIT, UNIT): 

xO, yO, xl, yl = 0, r, MAZE_H * UNIT, r 
self.canvas.create_Iine(xO, yO, xl, yl) 

# create origin 

origin = np.array([20, 20]) 

# hell 

helll_center = origin + np.array([UNIT * 2, UNIT]) 

seif.helll = self.canvas.create_rectangle( 

heIIl_center[o] - IS, helll_center[l] - 15, 
heIIl_center[o] + 15, helll_center[l] + 15, 
fill='black') 

# hell 

hell2_center = origin + np.array([UNIT, UNIT * 2]) 

self.hell2 = self.canvas.create_rectangle( 

hell2_center[0] - 15, hell2_center[l] - 15, 
hell2_center[0] + 15, hell2_center[l] + 15, 
fill='black') 


# create oval 

ovaI_center = origin + UNIT * 2 
self.oval = self.canvas.create_ovaI( 

oval_center[o] - 15, ovaI_center[l] - 15, 
oval_center[o] + 15, ovaI_center[l] + 15, 
fiII='yeIIow') 

# create red rect 

seif.rect = self.canvas.create_rectangle( 
origin[0] - 15, origin[l] - 15, 
origin[0] + 15, origin[l] + 15, 
fill='red') 


# pack all 
self .canvas.packO 

def reset(self): 
self .updateO 
time.sieep( 0 . 5 ) 

self.canvas.delete(self.rect) 
origin = np.array([20, 20]) 
seif.rect = self.canvas.create_rectangle( 
origin[0] - 15, origin[l] - 15 , 
origin[0] + 15, origin[l] + 15 , 
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fill='red') 

# return observation 

return self.canvas.coords(self.rect) 

def step(self, action): 

s = self.canvas.coords(self.rect) 
base_action = np.array([0, O]) 
if action == 0: # up 

if s[l] > UNIT: 

base_action[l] -= UNIT 
elif action == l: # down 

if s[l] < (MAZE_H - 1) * UNIT: 
base_action[l] += UNIT 
elif action == 2: # right 

if s[0] < (MAZE_1 aI - 1 ) * UNIT: 
base_action[0] += UNIT 

elif action == 3: # left 

if s[0] > UNIT: 

base_action[0] -= UNIT 

self.canvas.move(self.rect, base_action[o], base_action[l]) # move 
agent 

s_ = self.canvas.coords(self.rect) # next state 

# reward function 

if s_ == self.canvas.coords(self.oval): 
reward = 1 
done = True 

elif s_ in [self.canvas.coords(self.helll), self.canvas.coords(self. 
hell2)]: 

reward = -1 
done = True 
else: 

reward = 0 
done = False 

return s_, reward, done 

def render(self): 
time.sleep(O.l) 
self .updateO 

def updateO: 

for t in range(io): 
s = env.resetO 
while True: 
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env.renderO 
a = 1 

s, r, done = env.step(a) 
if done: 
break 

if _name_ == '_main_ 

env = Maze() 
env.after(lOO, update) 
env.mainloopO 


The RL_Brain Python File 

Now for the RL_brain Python file. We define the Q learning table structure that is 
generated while moving from one state to another. In the QLearningTable class, we 
structure the way the entire maze learns. We also declare hyperparameters for learning 
and determine the rate at which the program learns in the next chunk of code: 

import numpy as np 

import pandas as pd 

class QLearningTable: 

def _init_(self, actions, learning_rate=O.Ol, reward_decay=0.9, e_ 

greedy=0.9): 

self.actions = actions # a list 
self.lr = learning_rate 
self.gamma = reward_decay 
self.epsilon = e_greedy 

self.q_table = pd.DataFrame(columns=self.actions) 

def choose_action(self, observation): 
self.check_state_exist(observation) 

# action selection 

if np.random.uniformO < self.epsilon: 

# choose best action 

state_action = self.q_table.ix[observation, :] 
state_action = state_action.reindex(np.random.permutation(state 
action.index)) # some actions have same value 
action = state_action.argmax() 
else: 

# choose random action 

action = np.random.choice(self.actions) 
return action 

def learn(self, s, a, r, s_): 
self.check_state_exist(s_) 
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q_predict = self.q_table.ix[s, a] 
if s_ != 'terminal': 

q_target = r + self.gamma * self.q_table.ix[s_, :].max() # next 
state is not terminal 
else: 

q_target = r # next state is terminal 
self.q_table.ix[s, a] += self.lr * (q_target - q_predict) # update 

def check_state_exist(self, state): 
if state not in self.q_table.index: 

# append new state to q table 
self.q_table = self.q_table.append( 
pd.Series( 

[o]*len(self.actions), 
index=self.q_table.columns, 
name=state, 

) 

) 


Updating the Function 

This code segment declares a function that receives updates on the movement in the 
maze from one state to another. It also gives out rewards when the player transitions from 
one state to another. 

from maze_env import Maze 

from RL_brain import QLearningTable 

def update(): 

for episode in range(lOO): 

# initial observation 
observation = env.reset() 

while True: 

# fresh env 
env.renderO 

# RL choose action based on observation 
action = RL.choose_action(str(observation)) 

# RL take action and get next observation and reward 
observation_, reward, done = env.step(action) 

# RL learn from this transition 

RL.learn(str(observation), action, reward, str(observation_)) 
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# swap observation 
observation = observation_ 

# break while loop when end of this episode 
if done: 

break 

# end of game 
printCgame over') 
env.destroyO 

if _name_ == "_main_ 

env = Maze() 

RL = QLearningTable(actions=list(range(env.n_actions))) 

env.after(lOO, update) 
env.mainloopO 

If you get inside the folder, youdl see the run_this. py file and can get the output, as 
shown in Figure 4-2. 


Termind Terminal File Edit View Search Terminal Help 


t|, i 4:4^AM 0 


=a 


abhlubuBtu: -/ReInforcement-learnIng-wtth-tensorfLow/conten t s/2_Q_LedrnIng_maze 




(universe) abhigubuntu ^"/Relnforceiient-l.earnlng-wlth- 
^unlverse) abhtgubyfitu :'“/Rel^lforcertent-lea^ning-»ftth■ 
la_A3C 
ll_Dyna_Q 

1 2_P r ox1nal_P oli c y_Op ti niz a tio n 
Iconnandlinereinforceneotlearning 
2_0_Learntng_naze 
Sarsa_(*>aze 
_Sarsa_ianbda_naze 
5.l_Doubie_DQN 
5.2_PrioritizGd_Replay_OQN 
5.3_DueXing_D0N 
5_ D e e p_Q_N e two r k 
6_OpenAl_gyr> 

7_Pollcy_9 radien t_softnax 
_Actar_C ritic_Advantage 
_Deep__Dete rnini stic_Pollcy_Gradient_DE>PG 
universe) abhlSubuntu z^/Kelnforcenent-learnlng^with’ 
Learnlng_naze 

(universe) abhi@ubuAty :*/Relnforc*>rcnt-learnlng-^lth- 
rning_nazeS dir 

aze_env.py RL_braln*py run_this.py 
(universe) abhlgiubuntu i-^/Retnforcenent-learntng-with 
ningjazeS | 


tensorflowS cd contents 
tensorfiow/contentsS dir 


tensorfiow/contentsS cd 2_0 
tensorflow/contents/z_§_Lea 

tensorflow/contents/2_Q_Lea 


Figure 4-2, Running the file 
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Figure 4-3 shows the code running. 









_P ro j( ima’1_P ol i c y _Opt 1 (il I a t i on 
l_coinnand_line_relnforcenert_learnlng 
2_Q_Learning_maze 
3_S3rs3_maze 
4_Sars3_l3ribda_m3ze 
3.i_Doubie_DQN 
3.2_Prioriti2ad_Repl.ay_DgN 
S.3_Dueiing_DQN 
SDeepQNetwork 
SOpenAIgym 

7 _PoXicy_ 9 r 3 dient_softn 3 X 
8 _Actor_Critic_Advanta 9 e 
9_Deep_Deterministic_Pol.icy_Gradtent_ODPG 
(universe) abhi@ubuntu i^/Relnforcenent-learntng-with 
_Le a r nin g_ma z e 

(universe) abhi@ubuntu :-'/Relnforcenent-learning-with 
rnin 9 _j*iaieS dir 

maze_env,py RL_braln»pv run_thls,py 

(universe) abhigubuntu i'^/RelnfQrcenent-iearnlng-with-tensDrfiow/contents/2_Q_Lea 
rning_mazc5 python run_tbts*py 
game over 

(universe) abhl@ubuntu :^/Rclnf{>rccneiit-lcarntng-with-tensorfiow/contents/2_Q_Lca 
rning_nazeS python rijn this.py 


maze 







□ 


■ 

□ 

o 







tensorflow/contents$ cd Z_Q 
tensorflow/contents/2_Q_Le3 


Figure 4-3. The maze file being run 

Using the MDPToolbox in Python 

The MDP toolbox provides classes and functions for the resolution of discrete time 
Markov decision processes. The list of algorithms that have been implemented includes 
backwards induction, linear programming, policy iteration, Q learning, and value 
iteration along with several variations. 

The following are the features of the MDP toolbox (see Figure 4-4): 

• Eight MDP algorithms 

• Fast array manipulation using NumPy 

• Full sparse matrix support using Scip/s sparse package 

• Optional linear programming support using cvxopt 
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Figure 4-4, MDP toolboxfeatures 

Next, you see how to install and configure MDP toolbox for Python. First, switch to 
the Anaconda environment, as shown in Figure 4-5. 



Figure 4-5, Activating the Anaconda environment 
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Now install the dependencies using this command (see Figure 4-6): 

sudo apt-get install python3-numpy python3-scipy liblapack-dev libatlas- 
base-dev libgslo-dev fftw-dev libglpk-dev libdsdp-dev 


Termina Terminal File Edit vjew search T^miftal Help 
^ abhl(pybijntii:« 


H ^ 9:20 AM 0“ 



abhi^ubuntu^^^S source actitate universe 

(universe) ab^igubuntus^-S sudo apt-get Install python^-nurtpy python3'Sclpy llbla 
pack-dev llbatlas-base-dev llbgsie-dev fftw-dev llbglpk-dev llfadsdp-dev| 





Figure 4-6, Installing the dependencies 


When it asks you if it should install the dependencies, choose yes, as shown in 
Figure 4-7. 
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TerminaL 







libblas-coFinor libblas-dev 'Libblds3 libbtf 1.2.1 libcandZ.A.l Iibccolai*id2.9.1 
libcholnodS.0*6 \ibcsparse3 .1,4 libcj(sparse3.1.4 libdsdp-S.8gf 
libgfortran-5-dev libgfortran3 libglpk36 libgmp-dev libgi*ipxx4ldbl libgslZ 
libklul.3.3 liblapackl libldl2.2.1 ltbspqr2.0.2 libsuttesparse^dev 
libunfpackS.7.1 python3-decorator 
jSuggested packages: 

npt-defaults-btn gfortran-muXtlltb gfortran-doc gfortran-S-PiuIttllb 
gfortrari’S’doc Ii.b9fortr3n3-db9 Xlbblas-doc Itbiapack-doc Xlblapack-doc-Pian 
llblodbc2’dev XtbnysqXcXtent-dev qnp-iSac llbgmpie-doc llbnpfr-dev 
gsX'ref-psdoc | gsl-doc-pdf | gsl-doc-info | gsl-refpython-nunipy-doc 
python3-nose python3-nunpy-dbg python-scipy-doc 
iThe foHowing NEW packages wiH be installed: 



fftw-dev fftw2 gfortran gfortran-5 libandZ.4.1 libatlas-base-dev 
libatlas-dev llbatlas3-base llbblas-conmon llbblas-dev llbblas3 libbtf1.2,1 
libcand2.4.1 llbccolandZ.9.1 ltbcholrtod3.9.fi llbcsparseB,1,4 
Itbcxsparsel. 1.4 ltbdsdp-5.8gf Itbdsdp-dev Itbgfortran-S-dev lib9f<jrtrari[3 
llbglpk-dev XlbgXpk36 Xtbgmp-dev Xibgmpxx4XdbX Xlbgsl-dev XibgslZ 
llbkXui.3.3 XtbXapack-dev XXbX3pack3 XXbXdX2.2,l ltbspqr2.9.2 
Xtbsultesparse-dev XXbumfpackS.?. i python3'decorator python3’nut*)py 
python3'sclpy 

0 upgraded, 37 newXy InstaXXed, e to remove and 264 not upgraded. 

Need to get 34.7 HB of archlves. 

After thls operatlon, 155 HB of addltlonaX dlsk ^pace wili be u£ed. 

Do you want to continue? [Y/n] 


Figure 4- 7. Choose yes to proceed 


All the dependeixcies are theix installed, as shown in Figure 4-8. 


Termina Terminal File Edit view Search Terminal Help 
, " " abhli^ubitotu: ^ 


^ tl 4))) 2:43 am 0- 



Settlng up gfortran-5 (S,4.9‘figbuntul'-ia,94.4) ... 

Settlng up gfortran (4i5.3.i-iubuntui) ... 

update-alternatlvesi uslng /usr/bln/gfortran to provide /usr/bln/f95 (f^s) In au 
to node 

update-alternatlves: using /usr/bln/gfortran to provide /usr/bln/f77 (f77) in au 
to node 

settlng up Xlbblas-connon (3.6.0-2ubuntu2) ... 

Settlng up llbblas3 C3.6.0-2ubuntu2) ... 

update-alternatlves: jsing /usr/Xlb/Xibblas/XibbXas.so.3 to provide /usr/lib/lib 
bXa^.so,3 (XibbXas.so.3) in auto node 
Settlng up Xlbblas-dev (3.fi.9-2ubuntu2) ... 

update-aiternatlves: uslng /usr/llb/llbblas/Xlbblas.so to provide /usr/llb/llbbX 

as.so (libblas.so) In auto mode 

Settlng up XtbXapack3 (3,fi.9-2ubuntij2) _ 

upddte-alternattvesi using /usr/Xlb/Xapack/XlbXapack.so.S to provide /usr/llb/Xl 
blapack.so.3 (llblapack.so.3) In auto node 
Settlng up llbXapack-dev (3.fi.Q-2ubuntj2) ... 

update-alternatlves: using /usr/Xlb/Xapack/ltbXapack.so to provide /usr/lib/XibX 
apack.so (llbXapack.so) in auto node 
settlng up python3'decorator C4.e.6-i) ... 
settlng up python3-nunpy (l:l.ll.G-lubuntul) ... 

Settlng up python3-sclpy (9.17.0-1) _ 

Processing trlggers for libc-bln (2.23-0ubuntu9> ... 

(universe) abhl9ubuntu:»$ | 


Figure 4-8, The dependencies are installed 
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Nowyou can go ahead and install the MDP toolbox, as shown in Figure 4-9. 


Termini 

9 Terminal File Edit View search Terminal Help 

^ Hi 46) 2;46AM {> 


abhl^^ubuntu; ^ 



■setting up gfortran-5 (5.4.e-6ubuntul~16.64.4) ... 
Isetting up gfortran (4;5,3.l-lubuntul) ... 

H hj * ^ j m n .■ v i i 

... ...Ji-A ...-a-A. 1 


m 




to piode 

^djfupddte-alterndtlv^s: using /usr/bln/gfortran to provide /usr/bln/f77 Cf77) ifi au 
to piode 

Setting up llbblas-connon C3.6.9*2ubuntu2) 

Setting up Ilbbl3s3 (3.6.9-2ubuntu2) ... 

update-alternatlves: using /usr/lib/libbl3s/libbl3s,so3 to provide /usr/lib/lib 
blas.so.S (ll.bblas.so.3) in auto i«iode 
setting up iibblas-dev (3. 6 .9-2ubuntu2) ... 

update-alternatives: using /usr/Iib/libblas/libblas.so to provide /usr/iib/Iibbi 
as.so (iibblas.so) in auto node 
Setting up llblapack3 (3.6.e-2ubuntu23 ... 

update-alternatlves: using /usr/lib/lapack/Xiblapack,30.3 to provide /usr/iib/li 
biapack.so.3 (liblapack.so.3) in auto mode 
setting up iiblapack-dev (3.6.Q-2ubuntu2) ... 

update-aiternativest using /usr/lib/lapatk/liblapack.so to provide /usr/lib/iibl 
apack.so (iiblapack.so) in auto node 
Setting up python3-decorator (4.fi.6^1) ... 

Setting up python3-nunpy (isl-il-^iubuntul) ... 

Setting up pythonS-scipy (6.17.9-1) ... 

Processing triggers for libc-bin C2.23-Qubuntu?> ... 

(universe) abhi@ubuntu:^S pip install ''pyn[dptooiboj([LP]" 






Figure 4-9, Installing the MDP toolbox 
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The important packages are being installed, as shown in Figure 4-10. 


Term[nal 


t* ^ 2:50am ^ 





abhl^ubuntu: ~ 





ld^.so.3 (llbblas.so.S) in auto mode 
ettlng up llbblas-dev (3.e-2ubuntu2) _ 

pdate-alternativesr using /usr/llb/libblas/libblas.so to provide /usr/iib/iibbi 
s.so (llbblas.so) tn auto mode 
etttng up Ilbl3p3ck3 (3,6,&-2ubuntu2) 

update-alternatives: using /usr/lib/lapack/liblapack^so»! to provide /usr/lib/lt 
biapack.so.3 Cliblapack.so.3) in auto mode 
setting up liblapack-dev (3.6.9’2ubuntu2) ... 

update-alternatives: using /usr/iib/lapack/iibiapack.so to provide /usr/ilb/iibl 
pack.so (liblapack.so) in auto mode 

etting up python3-decorator (4.0.6-1) _ 

etting up python3‘numpy (1:1.11.0-lubuntul) ... 
etting up python3-sctpy '»■ 

rocessing triggers for libc-bin (2,23-eubuntu9) ... 
universe) abhiQubuntu:~$ pip instaii "pymdptooiboj([LP]" 
oileeting pymdptooibox[LP] 

Downloading pymdptoolbox-^.0-b3.tar.gz 

equirement already satisfied: numpy in ./anacondal/ervs/universe/iib/pythonl.S/ 
ite-packages (from pymdptoolbox[LP]) 

Requtrement already satisfied: scipy in ./anaconda3/envs/unlverse/iib/python3.S/ 
site-packages {fron pymdptcioXbox[LP]) 

Coilecting cvKopt (fro'*' pymdptoolbox[LP]) 

Downloading cvxopt-1.1.9'Cp3S'Cp3Sm-manyIinuxl_xS6_64.wbI ( 16 . 1 H 6 ) 

3.5HB 13ekB/S eta e:©li37 




Figure 4-10, Installing the important packages 
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If everything works as expected, you'11 get all the packages installed, as shown in 
Figure 4-11. 


Termtnal 











abhli^ubuntu: ~ 


ipdate-alternatlves: /jsr/lib/lapack/liblapack*so to provide /usr/llb/llbl 

apack.so (liblapack.so) In auto node 
Setttng up pythonj-decorator (4.6.6*13 
Settlng up python^*nur5py (1^1,11.©-lubuntul) 

Settlng up python3-sclpy (6,17,0’13 

Processing trtggers for libc-bln (2.Z3-Qubuntu9) *** 

(universe) abhi(3ubiintu:-^$ plp Install "pyFidptoolbox[LP]" 

Collectlng pyndptoolbox[LP] 

Downloading pyndptoolbox-4.6-b3.tar.gz 
Requifement alreedy satisfled: numpy in ./anaconddJ/envs/universe/Xib/pythonJ.S/ 
site-packeges (fron pyi^dptoolboK[LP]) 

Requirement already satlsfled: scipy in »/anacondaB/envs/unlverse/lib/python3*S/ 
site-packages (fron pyndptoolbox[LP]) 

Collectlng cvxopt (fron pypidptoolbox[LP]) 

Downloadtng cvxopt-1.1.9-cp3S-cp35m-manvlimjxl xS6 64.whl (16.1MB) 


lee^ II 


II 16.1MB 36ke/s 


iBuilding wheels for collected packages: pymdptoolbox 
Running setup.py bdist_wheel for pymdptoolbox ... done 

Stored in directory: /hopie/abhi/,cache/pip/Hheels/Sc/a6/45/de63bf423&efda4ae64 
5db99dc0b36ccf3fe679d2732aiS7b3 
Successfully bullt pyndptoolbox 

Installing collected packages: cvxopt* pymdptoolbox 
successfully installed cvxopt-i.i.s pymdptoolbox-4.ob3 
(universe) abhi@iibtintu:"$ I 


Figure 4-11, All the packages have been installed 
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Nowyou need to clone the repo from GitHub (see Figure 4-12): 
git clone https://github.com/sawcordwell/pymdptoolbox.git 


Tcrmins Terminal File Edit view Search Terminat Help 
~ abKlpubuncu: ~ 


f* i 3:01 AM -tlJ 





=02 


O 


Oownloading pyndpt<Jolbox-4.0-b3.tar.gz 
Requlrenent already satisfied: nui^py in ./anaconda3/envs/universe/Iib/python3.5/ 
stte-packages (fron pyndptoolbox[LP]) 

Requlrenent already satisfied: scipy tn ./anacQnda3/envs/universe/llb/python3.5/ 
slte-packages (fron pyndptoolbox[LP]) 

Collecting cvxopt (fron pyndptoolbox[LP]) 

Downlo3dl2gcvKOpt^i^i^9^cp3S-c£3Sr|W2anylinuxi_j(86_e4-whl (16 .ihb) 

Bulldlng wheels for collected packages: pyndptoolbox 
Runnlng setup.py bdist_wheel for pyndptoolbox ... done 

Stored In dlrectory; /hone/abhl/.cache/ptp/wheels/Sc/aO/45/dd63bf4230efda43e64 
5db99dc8b36ccf3f057Pd27320ie7b3 
Successfully bullt pyndptoolbox 

installlng collected packages: cuxopt, pyndptoolbox 
Successfully installed cvxopt'l.1.9 pyndptoolbox-4.0b3 

(universe) abhlgubuntu :git clone https://github.Gon/sawGordHell/pyndptoolbox. 
git 

Clonlng tnto ^pyndptoolbox*... 
renote: Countlng objects: 1395, done. 

Recelvlng objects: 199% (139S/1395), 346.32 KiB | 194.eo KiB/s, done. 
renote: Total 1395 (delta 9), reused 9 (delta e), pack-reused 1393 
Resolvlng deitas: 199% (833/833), done. 

Checfclng co n nec tivit y... do n e. 

I(unlverse) abhl@ubuntu :-$ I 


Figure 4-12, Cloning the repo 
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Switch to the mdptoolbox folder to see the details shown in Figure 4-13. 


Terminat 


i 43)) 3:04AM tjj 


abhl(Xubuntu:" 



Stored in directory: /hone/abhi/.cache/pip/wheels/Sc/a6/45/de63bf421©efda4ae64 
5db99dcSb36ccf3fe679d273201S7b3 
SuccessfuXly butit pyndptoolbox 

Installtng coXXected packages: cvxopt, pyndptoolbox 
SuccessfuXXy instaXXed cvxopt-1.1.9 pvndptoolbox-4.0b3 

(universe) abhi^ubuntui-S git clone https://github.con/sawcordweXX/pypidptooXbox. 
git 

CXoning into ' pypidptDolbox'... 
renote: Counting objects: 130S, done. 

Receiving objects: lOe^ (1305/1305), 34&.S2 Kl0 | 104.00 KtB/s, done. 
renote; TotaX 1305 (deXta 0), reused 0 (delta O), pack-reused 1305 
Resolving deitasr leo* (S3S/S3S), done. 
checking connectivity... done. 

(universe) abhi(3ubuntu:'^$ dlr 
anaconda3 

maconda3-4.2 .0-LiniJX-xa6_64.sh 
Desktop 
Docunents 
DownXoads 
exapiples.desktop 
gyn 


Public 

pyndptoolbox 

Relnforcenent‘learning‘With‘tensorfXow 

Tenplates 

universe 

untltledl.lpynb 

Untitied.ipynb 



Figure 4-13, Getting inside the folder 


You now need to switch to Python mode, as shown in Figure 4-14. 



cd-: connand not found 

(universe) abhi^ubuntu :^/pyndptoolbox/docs$ cd 
(universe) abhi@ubuntu:^$ python 

Python 3.5,3 |Anaconda custon (64-btt)] (default, Mar 6 2017, 11:50:13) 
[Gcc 4.4.7 Z01ZB313 (Red Hat 4.4.7-1)] on linux 

Type "Help", "Copyright", "credits" or "llcense" for nore infornatlon. 
»> I 


Figure 4-14, Inside Python mode 
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We will now use an example to see how the MDP toolbox works. First, import the 
MDP example, as shown in Figure 4-15. 



: cofMaod not fomd 
(oniveffe) cd > 

CVAlvefie) python 

ytl^ 3.S.3 fAnaconda cvttOA (64bit)} (default, nar 6 2017, 11:S8:13) 
[CCC 4.4.7 20120313 (Red Hat 4.4.7l)] OA llawx 

ype *help*, *copyrlghf, “credlts* or ®ltcen*t* for piore tnfor«atlon. 

> lAport Adptooibox.exanple 


>> 


Figure 4-15, Importing the modules 


A Markov problem assumes that future States depend only on the current state, not 
on the events that occurred hefore. We will set up an example Markov problem using 
a discount value of 0.8. To use the built-in examples in the MDP toolbox, you need to 
import the mdptoolbox. example and solve it using a value iteration algorithm. Then youTl 
need to check the optimal policy. The optimal policy is a function that allows the state to 
transition to the next state with maximum rewards. 

You can check the policy with the vi. policy command, as shown in Figure 4-16. 


|(universe) abhl@ubuntu:^$ python 

iPython 3.S.3 |Anaconda custon C64-bit)t {default, Mar 6 2@17, 11:53:13) 
[CCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on llnux 

Type "help", "Copyright", "credits" or "license" for nore infornation. 

»> ifnport ndptoolbox.exanple 

»> p, R = mdptoolbox.example.forest{) 

»> vl = mdptoolbox.mdp.valuelteratlori(P, R, 0.S) 

»> vi.runO 
>» vi.policy 
(0, 0, 0) 

>>> I 


Figure 4-1 6, Doing operatioris 


The output for the policy is (0,0,0). The results show the discounted reward for the 
implemented policy. 

Here is the full program: 

import mdptoolbox.example 

P, R = mdptoolbox.example.forestO 

vi = mdptoolbox.mdp.ValueIteration(P, R, 0.8) 

vi.runO 

vi.policy # resuit is (O, 0, O) 

Let's consider another example. First you need to import the toolbox and the toolbox 
example. Using the import example, you are bringing in the built-in examples that are in 
the MDP toolbox (see Figure 4-17). 

import mdptoolbox, mdptoolbox.example 
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Termina Terminal file £dit view Search Terminal Help 

t,| ^ 4») 7:52AM Ijj 

abhl^ubuntu: 



labhi@ubuntu :soufce acttvate universe 
■(universe) abhi^ubuntu python 

[python 3.S.3 jAnaconda custon (fi^^blt)! (default, Mar 6 2.&17 ^ 11:SS:13) 
[GCC 4.4,7 20120313 (Red Hat 4.4.7-1)] on itnux 

(lype “help"^ "Copyright" ^ "credlts" or ‘Itcense" for nore inforn^tlon, 
|»> tpiport mdptoolbox, mdptooibox.exanple 
p, R = mdptooibox.example-forestC) 
fh = hdptoolbox.mdp.FiniteHorizonCP, R, Q.9, 3) 

\>» fh.runC) 

|>>> fh.V 


2.0973, 

0.81 

, 0. 



5.9373, 

3.24 

> 1. 

* 

1 * 

9.9373, 

7.24 

. 4. 

, 0. 

31 ) 


t»> fh.policy 
|array{[[e, e, e]. 

[e. e. 1], 

[e. e. 8]]) 


■>>> 




Figure 4-17, Another example ofMDP 


We implemented verbose mode in the previous example so we can display the 
current stage and policy transpose. 

>>> import mdptoolbox, mdptoolbox.example 

>>> P, R = mdptoolbox.example.forestO 

>>> fh = mdptoolbox.mdp.FiniteHorizon(P, R, 0.9, 3) 

>>> fh.runO 
>>> fh.V 

array([[ 2.6973, 0.81 , 0. , 0. ], 

[ 5.9373, 3.24 , 1. , 0. ], 

[ 9.9373, 7.24 , 4. , 0. ]]) 

>>> fh.policy 
array([[0, 0, O], 

[0, 0, 1], 

[0, 0, 0]]) 

The next example is also in verbose mode and each iteration displays the number of 
different actions between policy n-1 and n (see Figure 4-18). 
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Termina Terminal File £dit Mew Search Termiral Help 
^ abhlpubuntu: « 


H i 8:40 AP^ ^ 





6 2017, 11:SS:13) 



[ S,9373, 3.24 ,1. , e. ], 

[ 9.9373, 7.24 , 4. , 6. ]]5 

fh-pollcy 

3T3y([[a, a, a], 

[a, a, i]. 

Ea, e, a]]) 

»> 

(urvlverse) Bbhi@ubuntu:--$ python 

Python 3.S,3 lAnaoonda custoj*i (64-bit)| (default, Har 
[GCC 4.4.7 20120313 (fted Hat 4.4.7-1)] on llnux 

Type “help”, "Copyright", "credlts" or "Xtcense'' for more tnformatton. 

»> inport indptoolbojc, mdptoolbox. example 
»> R = rjdptoolbox.exampte.rand<l0, 3) 

»> pt = ndptoolbox.(*idp.PolicyiterationCP, R, Q.9> 

»> pt^runC) 

»> P, R = ndptoolboK.example.forestO 

»> pi = ndptoolbox.pidp,PolicyIterationCP, R, 0.9) 

»> pi.rLinO 

expected = (26.244000000000914, 29.484000000000616, 33.484000000060616) 
all(e)tpected[k] - pt.V[k] < le-12 for k In r3ngeCXen(expected))) 

IT rue 

»> pt^policy 
(0, 0, 0) 



Figure 4-18, Policy between n-1 and n 


We are getting help from the built-in example of MDP, where we are trying to find 
the discounted MDP using a value iteration. As is the case with MDP, some of the values 
are randomly generated hy using rand (10,3) and some of the values are provided hy the 
decision-making process. 

We try to solve an MDP hy applying RL with a value iteration in this example: 

>>> import mdptoolbox, mdptoolbox.example 
>>> P, R = mdptoolbox.example.rand(l0, 3) 

>>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9) 

>>> pi.runO 

>>> P, R = mdptoolbox.example.forestO 

>>> pi = mdptoolbox.mdp.PolicyIteration(P, R, 0.9) 

>>> pi.runO 

»> expected = (26.244000000000014, 29.484000000000016, 33.484000000000016) 
>>> all(expected[k] - pi.V[k] < le-12 for k in range(len(expected))) 

True 

8 .2. Markov Decision Process (MDP) Toolbox: mdp module 21 
Python Markov Decision Process Toolbox Documentation, Release 4.0-b4 
>>> pi.policy 
( 0 , 0 , 0 ) 
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Understanding Swarm Intelligence 

Swarm intelligence is an important part of AI. It is the collective behavior of a 
decentralized, self-organized system, whether it be natural or artificial. 

Swarm intelligence typically consists of a population of simple agents or boids 
(artificial life programs) interacting locally with one another and with their environment, 
as illustrated in Figure 4-19. 



Figure 4-19, Swarm intelligence interactions 


Applications of Swarm Intelligence 

Figure 4-20 shows some applications of swarm intelligence. 
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Ant-Based Routing 

When you are dealing with something similar to telecommunication networks, this is 
called ant-based routing. The idea of ant based routing is based on RL, as there is lot of 
forward and backward movement along a particular network packet, which can be called 
the ant. This results in flooding the entire network. 


Crowd Simulations 

In the movies, crowd simulations are done with the help of swarm optimization. 

Human Swarming 

The concept of human swarming is based on the collective usage of different minds to 
predict an answer. It's when all of the brains of different human beings attempt to find a 
particular solution to a complex problem. Using collective brains in the form of human 
swarming results in more accurate results. 
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Swarm Grammars 

Swarm grammars are particular characteristics that act as different swarms working 
together to get varied results. The results can be similar to art or architectare. 


Swarmic Art 

Combining different characteristics of swarm behaviors between different species of birds 
and fish can lead to swarmic art that shows patterns in swarm behavior. 

Before we cover swarm intelligence in more detail, we touch on the Rastrigin 
function. Swarm optimization is based on different functions, one of which is the 
Rastrigin function, so you need to understand how it works. 


The Rastrigin Function 

In mathematical optimization problems, the Rastrigin function is a nonconvex function 
used as a performance test problem for optimization algorithms. 

The formula is shown in Figure 4-21 and Figure 4-22 shows its typical output. 

On an n-dimensionai domain it is defmed by: 

n 

/(x) = Jln + y] [xf ^ Acos{2'7iXi)] 

i=i 

where A = 10 and Xi e [—5 ,12,5*12]. Ii nas a gtobal minimum at x = 0 where /(x) = 0. 
Figure 4-21, Depiction ofthe Rastrigin function 
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Figure 4-22, Rastrigin function output 

Let's get started with using the Rastrigin function in Python. 
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You need to activate the Anaconda environment first: 

abhi@ubuntu:~$ source activate universe 
(universe) abhi@ubuntu:~$ 

Now switch to Python mode: 

(universe) abhi@ubuntu:~$ python 

Python 3 . 5.3 lAnaconda custom (64-bit)| (default, Mar 6 2017, 11:58:13) 
[GCC 4 . 4.7 20120313 (Red Hat 4 . 4 . 7 - 1 )] on linux 

Type "help", "Copyright", "credits" or "license" for more information. 
>>> 


As we start building important libraries, Python will cache them if they are not 
created, as shown in Figure 4-23. 


Termini Terminal Fite Edit view Search Termina! Help^ 

abhlpubuntu; ^ 


t* t 41 » 3:5Sam ^ 



abhi@ubunty source activate uriverse 
(universe) abht@ubuntu;-^$ python 

Python 3.S*3 jAnaconda custop» (64-bit)| (default, Har 6 2617, 11:58:13) 

[GCC 4*4*7 20128313 (Red Hat 4*4*7-i)] on linux 

Type '"heip", "Copyright”, "credits" or "license" for nore inforrnation. 

»> fron natplotlib inport cpi 

>» fron npl_toolkits.nplot3d inport Ajees30 

/hone/abhl/anaconda3/envs/universe/ltb/pvthon3.S/site-packages/natplotlib/font_n 
aria9er*pyi280i userwarning: Hatplotllb is building the font cache using fc-list* 
Thls nay take a moment. 

'Hatplotllb is building the font cache using fc-list. ' 


Figure 4-23, Cache being created 


The entire flow of the Python program is as follows: 
python 

Python 3 . 5.3 lAnaconda custom (64-bit)| (default, Mar 6 2017, 11:58:13) 
[GCC 4 . 4.7 20120313 (Red Hat 4 .4.7-1)] on linux 

Type "help", "Copyright", "credits" or "license" for more information. 

>>> from matplotlib import cm 

>>> from mpl_toolkits.mplot3d import Axes3D 

/home/abhi/anaconda3/envs/universe/lib/python3.5/site-packages/matplotlib/ 
font_manager.py:280: UserWarning: Matplotlib is building the font cache 
using fc-list. This may take a moment. 

'Matplotlib is building the font cache using fc-list. ' 

>>> import math 

>>> import matplotlib.pyplot as plt 

>>> import numpy as np 

>>> def rastrigin(*X, **kwargs): 

A = kwargs.get('A', lO) 
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... return A + sum([(x**2 - A * np.cos(2 * math.pi * x)) for x in X]) 

• • • 

>>> if _name_ == '_main_ 

X = np.linspace(-4, 4, 200 ) 

Y = np.linspace(-4, 4, 200 ) 

• • • 

>>> X, Y = np.meshgrid(X, Y) 

File "<stdin>", line 1 
X, Y = np.meshgrid(X, Y) 

A 

IndentationError: unexpected indent 
>>> 

>>> Z = rastrigin(X, Y, A=10) 

File "<stdin>", line 1 
Z = rastrigin(X, Y, A=10) 

A 

IndentationError: unexpected indent 
>>> 

>>> fig = plt.figureO 

File "<stdin>", line 1 
fig = plt.figureO 

A 

IndentationError: unexpected indent 
>>> ax = fig.gca(projection='3d') 

File "<stdin>", line 1 
ax = fig.gca(projection='3d') 

A 

IndentationError: unexpected indent 
>>> 

>>> ax.plot_surface(X, Y, Z, rstride=l, cstride=l, cmap=cm.plasma, 

linewidth=0, antialiased=False) 

File "<stdin>", line 1 

ax.plot_surface(X, Y, Z, rstride=l, cstride=l, cmap=cm.plasma, 
linewidth=0, antialiased=False) 

A 

IndentationError: unexpected indent 
>>> plt.savefig('rastrigin.png') 

File "<stdin>", line l 
plt.savefig('rastrigin.png') 

A 

IndentationError: unexpected indent 

>>> if _name_ == '_main_': 

X = np.linspace(-4, 4, 200 ) 

Y = np.linspace(-4, 4, 200 ) 

• • • 

>>> X, Y = np.meshgrid(X, Y) 

>>> Z = rastrigin(X, Y, A=10) 

>>> fig = plt.figureO 
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>>> ax = fig.gca(projection='3cl') 

>>> ax.plot_surface(X, Y, Z, rstride=l, cstride=l, cmap=cm.plasma, 
linewidth=0, antialiased=False) 

<mpl_toolkits.mplot3d.art3d.Poly3DCollection object at 0x7f79cfc73780> 
>>> plt.savefig('rastrigin.png') 

>>> 


If you go back to the folder, you can see that the rastrigin. png file was created, as 
shown in Figure 4-24. 



Figure 4-24, Rastrigin function PNG file being saved 


The rastrigin. png file's output from the prohlem shows the minima, as shown in 
Figure 4-25. It is very difficult to find the glohal optimum. 
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Figure 4-25, The Rastriginfunction PNG file 

Swarm Intelligence in Python 

This section looks at a program in Python that works with the concept of swarm 
intelligence. You will therefore get to know particle swarm optimization (PSO) within 
Python. You can achieve this with the help of a research toolkit known as PySwarms. 

PySwarms is a good tool to implement optimization algorithms with the PSO 
method, such as: 

• Star topology 

• Ring topology 

First, you need to install PySwarms. Get inside the terminal and activate the 
Anaconda environment using the following command. 

abhi@ubuntu:~$ source activate universe 
(universe) abhi@ubuntu:~$ 

The dependencies prior to installing PySwarms are as follows: 

numpy >= 1.13.0 
scipy >= 0.17.0 
matplotlib >= 1.3.1 
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Now install PySwarms as follows: 


(universe) abhi@ubuntu:~$ pip install pyswarms 


Now the process is complete. 

Figure 4-26 shows that PySwarms is completely installed. 


Termfnal 

i 4>)) 7:5JaK\ i 


abhl(3)ubuntu: - 



Requirenent already satisfiedi 
b/pythonj.s/site-packages (fron 

_ .k. _1.._ Jl ^ 

natplotlib>=1.3.1 tn ./anaconda3/envs/universe/ll 
pyswarns) 


1 ^ 1 ' 





hon3.S/site-packages (fron pyswarns) 

Requtrepient already sati.sfted: future==©,16»0 tn »/3nacortda3/envs/uni.verse/lib/f 
ythoni.S/site-packages (fron pyswarns) 

Requtrenent already sati.sfledi nock==2»6*0 tn ./anaca^lda3/envs/u^tve^se/1.1.b/pyt^ 
oni3.S/slte-packages (fron pyswarms) 

Requirenent already sati.sftedi six>=i.l6 In ./anaconda3/envs/unlverse/li.b/pythor 
3.S/slte-packages (fron natplotltb>=J..3.l->py5warns) 

Requtrenent already satisfted: python-dateutll In */anaconda3/envs/unlverse/li.b^ 
python3.5/5lte-packages (fron natplotltb>=l,3,l->pyswarns) 

Requirenent already satisfied^ pyti in ./anaconda3/envsyuni.verse/llb/python3.5/£ 
Ite-packages (from ndtplotltb>-l,3.1‘?pyswarns) 

Requlrenent already satlsfledt cycler>=e*ie in »/anaconda3/envs/universe/lib/pyl 
honl.S/slte^packages (fron matplotllb>=l.3.1->pyswarms) 

Requirenent already satisfied: pyparsingJ=2*0*4,!=2.1.2,!=2* 1.6^>=1.S*6 in */ane 
conda3/envs/universe/libypython3,5/site-packages (fron n3tplotltb>=l»3,l->pyswar 
ns) 

Requirenent already satisfied: pbr>=0.11 in »/anaconda3/envs/universe/lib/pythor 
3.5/site-packages (fron nock==2*e.e->pyswarns) 

Installing collected packagesi pyswarns 
Successfully installed pyswarns-fi.1 *7 
(universe) abhl0ubuntu : -S I 


Figure 4-26, PySivarms are installed 


Now we move to Python mode. 

(universe) abhi@ubuntu:~$ python 

Python 3.5.3 |Anaconda custom ( 64 -bit)| (default, Mar 6 2017, 11:58:13) 
[GCC 4.4.7 20120313 (Red Hat 4 . 4 . 7 - 1 )] on linux 

Type "help", "Copyright", "credits" or "license" for more information. 
>>> 


First, you need to import the PySwarms Utilities as follows: 

>>> import pyswarms as ps 

There are different functions that you can use in PySwarms for that you have to 
import: 

>>> from pyswarms.utils.functions import single_obj as fx 
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Next, you need to declare these hyperparameters: 

>>> options = {'cl': 0.5, 'c2': 0.3, 'w':0.9} 

In this case, we are configuring the swarm as a dictionary, so call it a dictionary. 

In the next step, you create the instance of the optimizer by passing the dictionary 
with the necessary arguments. 

>>> optimizer = ps.single.GlobalBestPS0(n_particles=10, dimensions=2, 
options=options) 

After that, call the optimizer method and sto re the optimal cost and position after 
optimization. Figure 4-27 shows the results. 


Termtna Terminal File Edit View Termtnal Help 


tl 4i}) B:SaAM ^ 


W 


abht(^ubunl:u: 




Traceback (nost recent call last); 

File line in <i*iodule> 

NaneError: nane *Sphere' is not deflned 
options = Fcl': 0.5, ‘cZ*: 0.3, 

»> optimizer = ps.single.Global0estPSO(n_particles=i0, dimensions=2, options^op 
tions) 

>>> cost, pos = optlnlzer.optimizeCfX-sphere_func, print_step=iQa, iters=ieee, v 
efbose=3) 

INFO:pyswarms.slngle.global^best:Iteratior 1/1000, cost: 0-00404260467257 
INFO:pysw3rns.5ingle.global_best:Iteratiori lOl/iOOO, cost: 7.27947358732e-0S 
lNFO:pyswarms.stngle.global_best:Iteration 201/1066, cost: 
iNFO:pyswarms.single. 9 lobal_best:iteration 36i/i6ee, cost: 
lNFO:pyswarms.sin 9 le. 9 lobal_bestsiteratiori 461/1606, cost: 

INFO:pyswarns.slngle.global_best:Iteration 501/1006, cost: 
INFO:pyswarFis,single. 9 lobal_best:Iteration 661/1006, cost: 
INFO:pyswarms.stngle.global_best:Iteratiori 761/1066, cost: 
lNFO:pyswarms.slngle. 9 lobal_bestilteration sei/iGoe, cost: 
lNFO:pyswarms.single. 9 lobal_bestsiteration 961/1666, cost: 

INFO:pyswa rns.slngle.global_best:============================: 

Optimization finishedi 
Final cost: 6.6066 

Best value: t-1.477S614ei23B7442e’2l, 4.16146l9S6e6837SBe-23] 


1.8ie92S7S892e-ll 

4.169114a5318e'lS 

Z.296S4721117e-18 

4.79472341834e-27 

0.21001311664e‘29 

7,3120363lS29eO3 

Z.79e3335e63le’37 

1.495227736656-38 


t>>> 


Figure 4-27, Showing the resuit 


After going through the results, you can see that optimizer was able to find a good 
minima. 

You will now do the same using the local best PSO. You configure and similarly 
declare a dictionary as follows: 

>>> options = {'cl': 0.5, 'c2': 0.3, 'w':0.9, 'k': 2, 'p': 2} 

Create the instance of the optimizer: 

>>> optimizer = ps.single.LocalBestPS0(n_particles=10, dimensions=2, 
options=options) 
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Nowyou call the optimize method to sto re the value as you did before. 

By using the verbose argument, you can control the verhosity of the argument and 
use print_step to count after a certain numher of steps. 

>>> cost, pos = optimizer.optimize(fx.sphere_func, print_step=50, 
iters=1000, verbose=3) 

The output is shown in Figure 4-28. 


Terminal 


n i 

^ 4])) 9:40 AM 

^ _ 

abhl^ubuntu: 




=S3 



INFO:pyswarns 
INFO:py^warns 
INFOipyswarns 
If^FOi pysw^rns 
INFO: pysi^^ms 
INFO:pyswarms 
INFO:pyswdrns 
INFO: pysMarpis 
INFO:pyswarns 
INFO:pyswarns 
IMFO:pyswarns 
IMFOipyswarns 
iNFOi pyswarpis 
INFO:pyswarns 
INFO: pyswams 
INFO:pyswarns 
INFO:pyswarns 
INFO:pyswarms 
INFO:pyswarpis 
pptiPilzatton 
Ftnal cost: 0 
Be&t valuet [ 

U> I 


.slngle.local 
.slngle.local 
.si.ngle.locaX 
.slngle.local 
,slngle.local 
.slnglf.local 
^ slngle.local 
^ slngle.local 
. slngle.local 
.slngle.local 
.slngle.local 
.slngle.local 
.slngle.local 
.slngle.local 
.slngle.local 
.slngle.local 
. slngle.local 
.slngle.local 
.slngle.local 
flnlshed! 

.6900 

-4.3602931623247e0le-21p 6.22132341477936946-22] 


best 

best 

^best 

>est 

|best 

>est 

^best 

.best 

best 

best 

^best 

>est 

>est 

_best 

^best 

best 

best 

best 

best 


Iteratton 

Iteration 

Iteratton 

Iteratton 

Iteratton 

Iteratton 

Iteratton 

Iteratton 

Iteratton 

Iteratton 

Iteratton 

Iteratton 

Iteratton 

Iteratton 

Iteratton 

Iteratton 

Iteratton 

Iteratton 


161/1600, 

151/1606, 

261/1666, 

251/1666, 

361/1666, 

351/1666, 

461/1666, 

451/1606, 

561/1606, 

551/1666, 

661/1606, 

6S1/1666, 

761/1666, 

751/1666, 

861/1666, 

851/1606, 

901/1666, 

951/1666, 


cost 

COSt 

cost 

cost 

cost 

cost 

cost 

cost 

cost 

cost 

cost 

cost 

cost 

cost 

cost 

cost 

cost 

cost 


5.23906936789e-69 

5.239G69367S9e-09 

3.427540970410'11 

4.676987442610-13 

1.171270416280-15 

7.773617533420-17 

1.157792039150-19 

8.062507429450-23 

1.292380528960-24 

5.490522462080-25 

2.64186402480-26 

2.336291796460-27 

4.9584421060-36 

1.092745860040-30 

1.997523861020-33 

1.509326738360-34 

1.7672565410-37 

2.780890136520-38 


Figure 4-28, The output ofthe swarm optimization 

Building aGameAl 

We have already discussed the game AI with OpenAI Gym and environment simulation, 
but we take it further in this section. First, we will clone one of the most important and 
simplest examples of game AI, as shown in Figure 4-29. 
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=D 





universe) ebhi@ubuntu :--/Reinforcenent-learning-with-ten.5orflowS dir 
conteris experinents LICENCE REAOHE.md RL_cover.jpg 

(universe) abhi@ubuntu r^/Reinforcenent-learnlng-wlth-tenbSorfVow$ cd contents 
(universe) abhieubuntu r-/ftelnforcenent-learning-wlth-te(isorflow/contentsS dir 
1©_A3C 
ll_Dyna_Q 

12_Proxinal_Policy_Optini2ation 

l_c onnard _line_rein f o r c e nen t_\ee r ning 

2_Q_L ee r n tr>g_P5 a ze 

3_sarsa_ndze 

4_sa rsa_ianbda_maze 

5 *l_Doubie_DQN" 

5.2_Prioritized_Replay_DQN 

5.3_Dueltng_O0N 
5_Deep_0_Nethior k 
6_0penAI_9yn 

7_Poiic y_g r adie n t_so f t na x 

SActorCriticAdvantage 

9_Deep_Deterninistic_Policy_Gradient_DOPG 

(universe) abhi^gbuntu r^/Relnforcenent-learntng-wltb-tensorflow/contentsS cd 2_0 
_Learning_Daze 

(universe) abhl@ubuntu :--/Reinforcenent-learning-with*ten5orflow/contents/2_Q_Lea 
rning_Fiaze$ dir 

rniazeenv.py RLbrain.py runthis-py 

(universe) abhl^ubuntu i -/Relnforcenent-learning-wlth-tensorflow/c(intents/2_Q_Le» 
rning_nazeS python run_this*py 
gane over 

(universe) abhigubuntu :-/ReinfDrcenent>^learning<^with*tensorflow/contents/2_Q_Lea 
rning_naze$ python run_this.py 
gane over 

(universe) abhH3ubuntur~/Rclnforcenent-learning-with-tensorflow/contents/2^Q_Lea 
rning_naze5 cd “ 

(universe) abhi@ubuntu git clone https://github*con/llSourceli/Gane-AI*git 


Figure 4-29, Cloning the repo 


You first need to set up the environment. The requirements are as follows: 

• TensorFlow 

• OpenAI Gym 

• virtualenv 

• TFLearn 

There is one dependency to install—the Virtual environment. You install it using this 
command: 

conda install -c anaconda virtualenv 

It will askyou whether you want to install the new virtualenv package, as shown in 
Figure 4-30. Choose yes. 
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Figure 4-30, Getting the virtualenv package 


When the package installation is successful and complete, youdl see the screen in 
Figure 4-31. 
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Figure 4-31, Package installation is complete 


Nowyou can install TFLearn using this command: 
conda install -c derickl tflearn 

When you attempt to install TFLearn, you may get this error about an OS version 
mismatch: 

conda install -c derickl tflearn 

Fetching package metadata . 

Solving package specifications: . 

PackageNotFoundError: Package not found: '' Package missing in current 
linux-64 channels: 

- tflearn 

You can search for packages on anaconda.org with 
anaconda search -t conda tflearn 
(universe) abhi@ubuntu:~$ anaconda search -t conda tflearn 
Using Anaconda API: https://api.anaconda.org 
Run 'anaconda show <USER/PACKAGE>' to get more details: 
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Packages: 


Name 

Version 

Package Types 

Platforms 

asherp/tflearn 

1 0.2.2 1 

conda 

osx-64 

contango/tflearn 

0 .3.2 

conda 

linux-64 

derickl/tflearn 

0 .2.2 

conda 

OSX-64 


Found 3 packages 

If this happens, be sure to install the one that's for linux-64: 

(universe) abhi@ubuntu:~$ anaconda show contango/tflearn 
Using Anaconda API: https://api.anaconda.org 
Name: tflearn 

Summary: 

Access: public 
Package Types: conda 
Versions: 

+ 0 . 3.2 

To install this package with Anaconda, run the following command: 
conda install --channel https://conda.anaconda.org/contango tflearn 
It will ask for installation of other packages, as shown in Figure 4-32. 



Figure 4-32, Installation of other packages 
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Now import the relevant libraries using this command: 

(universe) abhi@ubuntu:~$ python 

Python 3 . 5.3 lAnaconda custom (64-bit)| (default, Mar 6 2017, 11:58:13) 
[GCC 4 . 4.7 20120313 (Red Hat 4 . 4 . 7 - 1 )] on linux 

Type "help", "Copyright", "credits" or "license" for more information. 

>>> import gym 

>>> import random 

>>> import numpy as np 

>>> import tflearn 

>>> from tflearn.layers.core import input_data, dropout, fully_connected 

>>> from tflearn.layers.estimator import regression 

>>> from statistics import median, mean 

>>> from collections import Counter 

>>> LR = le-3 

>>> env = gym.make("CartPole-vO") 

[ 2017 - 09-22 08:22:15,933] Making new env: CartPole-vO 
>>> env.resetO 

array([-0.03283849, -0.04877971, 0.0408221 , -0.01600674]) 


The Entire TFLearn Code 

To start with, you need to import the important libraries. TFLearn creates the prototyping 
so the program can implement RL very quickly. 

Add a learning rate. You do this by initializing a simulated environment and then 
indicating the movement pattern with the following command: 

action = env.action_space.sample() 

This example pairs the observation with is the movement of the balanced cart- 
pole (moving left or right). In the given problem, the basis of RL is the score that we are 
referencing. 

After applying the RL, we are training the model with TFLearn, a module for 
TensorFlow that's used to create a fully connected neural network and produce a faster 
training process. 

import gym 
import random 
import numpy as np 
import tflearn 

from tflearn.layers.core import input_data, dropout, fully_connected 

from tflearn.layers.estimator import regression 

from statistics import median, mean 

from collections import Counter 

LR = le-3 

env = gym.make("CartPole-vO") 
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env.resetO 
goal_steps = 500 
score_requirement = 50 
initial_games = lOOOO 
def some_random_games_first(): 

# Each of these is its own game. 
for episode in range(5): 

env.resetO 

# this is each frame, up to 200...but we wont make it that far. 
for t in range(200): 

# This will display the environment 

# Only display if you really want to see it. 

# Takes much longer to display it. 
env.renderO 

# This will just create a sample action in any environment. 

# In this environment, the action can be 0 or 1, which is left 
or right 

action = env.action_space.sample() 

# this executes the environment with an action, 

# and returns the observation of the environment, 

# the reward, if the env is over, and other info. 
observation, reward, done, info = env.step(action) 
if done: 

break 

some_random_games_first() 
def initial_population(): 

# [OBS, MOVES] 
training_data = [] 

# all scores: 
scores = [] 

# just the scores that met our threshold: 
accepted_scores = [] 

# iterate through however many games we want: 
for _ in range(initial_games): 

score = 0 

# moves specifically from this environment: 
game_memory = [] 

# previous observation that we saw 
prev_observation = [] 

# for each frame in 200 
for _ in range(goal_steps): 

# choose random action (o or l) 
action = random.randrange(0,2) 

# do it! 

observation, reward, done, info = env.step(action) 
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# notice that the observation is returned FROM the action 

# so we'll store the previous observation here, pairing 

# the prev observation to the action we'll take. 
if len(prev_observation) > 0 : 

game_memory.append([prev_observation, action]) 
prev_observation = observation 
score+=reward 
if done: break 

# IF our score is higher than our threshold, we'd like to save 

# every move we made 

# NOTE the reinforcement methodology here. 

# all we're doing is reinforcing the score, we're not trying 

# to influence the machine in any way as to HOW that score is 

# reached. 

if score >= score_requirement: 
accepted_scores.append(score) 
for data in game_memory: 

# convert to one-hot (this is the output layer for our 
neural network) 

if data[l] == l: 

output = [ 0 , 1 ] 
elif data[l] == 0: 
output = [ 1 , 0 ] 

# saving our training data 
training_data.append([data[ 0 ], output]) 

# reset env to play again 
env.resetO 

# save overall scores 
scores.append(score) 

# just in case you wanted to reference later 
training_data_save = np.array(training_data) 
np.save('saved.npy',training_data_save) 

# some stats here, to further illustrate the neural network magic! 
print('Average accepted score:',mean(accepted_scores)) 

print('Median score for accepted scores:',median(accepted_scores)) 
print(Counter(accepted_scores)) 

return training_data 

def neural_network_model(input_size): 

network = input_data(shape=[None, input_size, l], name='input') 
network = fully_connected(network, 128, activation='relu') 
network = dropout(network, 0.8) 

network = fully_connected(network, 256 , activation='relu') 
network = dropout(network, 0.8) 

network = fully_connected(network, 512, activation='relu') 
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network = dropout(network, 0.8) 

network = fully_connected(network, 256, activation='relu') 
network = dropout(network, 0.8) 

network = fully_connected(network, 128, activation='relu') 
network = dropout(network, 0.8) 

network = fully_connected(network, 2, activation='softmax') 
network = regression(network, optimizer='adam', learning_rate=LR, 
loss='categorical_crossentropy', name='targets') 
model = tflearn.DNN(network, tensorboard_dir='log') 
return model 

def train_model(training_data, model=False): 

X = np.array([i[o] for i in training_data]).reshape(-l,len(training_ 
data[o][0]),l) 

y = [i[l] for i in training_data] 
if not model: 

model = neural_network_model(input_size = len(X[o])) 

X = np.reshape(x, (-1, 30, 9)) 

model.fit({'input': X}, {'targets': y}, n_epoch=5, snapshot_step=500, 
show_metric=True, run_id='openai_learning') 
return model 

model = train_model(training_data) 
scores = [] 
choices = [] 

for each_game in range(io): 
score = 0 
game_memory = [] 
prev_obs = [] 
env.resetO 

for _ in range(goal_steps): 
env.renderO 
if len(prev_obs)==0: 

action = random.randrange(0,2) 
else: 

action = np.argmax(model.predict(prev_obs.reshape(-l,len(prev 
obs),l))[0]) 
choices.append(action) 

new_observation, reward, done, info = env.step(action) 
prev_obs = new_observation 
game_memory.append([new_observation, action]) 
score+=reward 
if done: break 
scores.append(score) 

print('Average Score:',sum(scores)/len(scores)) 

print('choice l:{} choice 0:{}'.format(choices.count(l)/ 

len(choices),choices.count(o)/len(choices))) 

print(score_requirement) 
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Here is the output: 

Average Score: 195.9 

choice 1:0.5074017355793773 choice 0:0.49259826442062277 
50 

Solved. 


Conclusion 

This chapter touched on Q learning and then showed some examples. It also covered the 
MDP toolbox, swarm intelligence, and game AI, and ended with a full example. Chapter 5 
covers Reinforcement Learning with Keras, TensorFlow, and ChainerRL. 
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CHAPTER 5 


Reinforcement Learning 
with Keras, TensorFlow, 
and ChainerRL 


This chapter covers using Keras with Reinforcement Learning and defines how Keras can 
be used for Deep Q Learning as well. 


What Is Keras? 

Keras is an open source frontend library for neural networks. We can say that it works as 
a backbone for the neural network, as it has very good capabilities for forming activation 
functions. Keras can run different deep learning frameworks as the backend. 

Keras runs with lots of deep learning frameworks. The way to change from one 
framework to another is to modify the keras. json file, which is located in the same 
directory where Keras is installed. 

The backend parameter needs to change as follows: 


{ 

"backend" : "tensorflow" 

} 


You can change the parameter from TensorFlow to another framework if you want. 
In the JSON file, if you want to use it with Theano or CNTK, you can do so by 
changingthe backend parameter. 

The structure of a keras. j son file looks like this: 


{ 


} 


"image_data_format": "channels_last", 
"epsilon": le-07, 

"floatx": "float32", 

"backend": "tensorflow" 


© Abhishek Nandy and Manisha Biswas 2018 
A. Nandy and M. Biswas, Reinforcement Learning, 
https://doi.org/10.1007/978-l-4842-3285-9_5 
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The flow of all the Keras frameworks is shown in Figure 5-1. 



Figure 5-1, Keras and its modification with different frameworks 

Using Keras for Reinforcement Learning 

This section covers installing Keras and shows an example of Reinforcement Learning. 
You first need to install the dependencies. 

The dependencies are as follows: 

• Python 

• Keras 1.0 

• Pygame 

• Scikit-image 

Let's start installing Keras 
Anaconda environment: 

conda install -c jaikumarm 


1.0. This example shows how to install Keras from the 


keras 
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It asks for permission to install the new packages. Choose yes to proceed, as shown 
in Figure 5-2. 



(universe) abhlgubuntu■ «S conde Instell *c jetkunarn keres 

Fetching package netadeta ..... 

Solvlng package speclfIcatlons: *.*.»*.**. 

Packege plan for Installatlon in environnent /hone/abhl/anaconda3/envs/untverse 
The followlng packages will be downloaded: 

package | bulld 

----- ^ iir ^ ^ ^ * 

theano-0,9»0.dev4 | py35_6 4.0 M0 jalkunarn 

keras*2.0.8 j py3Sh94dfa4be_0 5.7 HB jalkunarn 


Total: 9.7 MB 

The following NEW packages wlll be INSTALLEDr 

keras: 2>0.fi-py3Sh94db4be_0 Jalkui^iarm 
theanoi 0.9*6.dev4'py35_0 jalkumarm 

Proceed C[y]/n)? | 


Figure 5-2, The updates to be installed 


When the package installation is successful and completed, youdl see the 
information shown in Figure 5-3. 



Proceed ([y]/n)? y 



Fetchlng packages ... 

theano-e.9.0.d 100% | Tine: 0:01:07 61.88 kB/s 

keraS-2.0.8-py 100% i»9ff9f9999»99«««9»999999«999999fj Tine: 0:01:29 67.SS kB/S 
Extracting packages ... 

[ COHPLETE ]|ff99999»«9f99«99999«»999««99S9#999«9#»V99ff«9»9»9««9| 100% 

Linking packages ... 

[ COHPLETE ]|ff9«99999«f99999f999999999999f99«9999#«»99f99S«»9f9| 100% 

(universe) abhiiiubuntu:>$ | 


I 


Figure 5-3, The package installation is complete 


You can also install Keras in a different way too. This example shows you how to 
install it using pip3. 

First, use sudo apt update as follows: 

(universe) abhi@ubuntu:~$ sudo apt-get update 
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Then install pip3 as follows: 


sudo apt-get -y install python3-pip 


Figure 5-4 shows the installation process. 






(universe) abhi@ubuntu sudo apt-get -y install pythonB-pip 

Reading package lists,». Oone 

Bullding dependency tree 

Readlng state tnfomation., . Done 

The followlng packages were auto^iatlcally tnstalled and are no longer requtred: 
llnux~headers-4*4.o-ii linux-headers-4.4.e-31-generic 
llnux-tnage-4.4.9-31-genertc llnux-tnage-extra-4.4.0-31-generic 
Use 'sudo apt autorenove" to renove then. 

The followlng addltional packages will be installed: 

python-plp-whl pythonl-setuptools python3-wheel 
Suggested packages: 
python-setuptools-doc 

The followlng KEW packages wtll be installed: 

python-plp-whl pythoni-pip pythonJ-setuptools python3-wheel 
[0 upgraded* 4 newly installed, 6 to renove and Z72 not upgraded, 

Need to get 1*356 kB of archives, 

After this operatlon, z,439 kB of additlonal dlsk space will be used. 

Get:l http;//us.archive.ubuntu.copi/ubuntu xenlal-updates/universe and64 python-p 
ip-whl all 8.I»l-2ubuntu0,4 [1,116 kB] 

Get:2 http://us*archive*ubuntu*copi/ubuntu xenlal-updates/universe and64 pytbon3- 
plp all 3.1.l-2ijbuntu0.4 [109 kB] 

Get:3 http://us*archive*ubuntu*con/ubuntu xenlal/nain apid64 python3-setuptools a 
ll 20.7.0-1 [88.0 kB] 

Get:4 http://us,archive,ubuntu,con/ubuntu xenlal/universe and64 python3-wheel al 
l 0.29.0-1 [48.1 kB] 

Fetched 1,356 kB In 20s (67.3 kS/s) 


Figure 5-4, Installing pip3 


After the dependencies, you need to install Keras (see Figure 5-5): 
(universe) abhi@ubuntu:~$ sudo pip3 install keras 
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Terminal File Edit view search Terminal Help 




Preparlng to unpack .../python^-wheelO..deb ... 
dnpacklng pythan3-wheel (d»29,0'l) ,,, 

Processing trlggers for nan-db (2*7.S-i) ... 
kettlng up python-plp-v^hl (8*i*i-zubuntua.4) ... 
pettlng up pythonJ-plp (&.1*1-Zubijrvtij0.4) ... 

Settlng up pythonJ-setuptools (20.7.0-1) ... 

Setttng up pythonJ-wheeX (0.29.0-1) ... 

(universe) abhlgubuntu sudo plp3 install keras 

The dlrectory '/^lOf^^/^bht/.cache/plp/http' or Its parent dlrectory is not owned 
by the current user and the cache has been dlsabled. Please check the perntsslon 
s and owner of that dlrectory. If executlng plp ulth sudo^ you nay want sudo's - 
H flag. 

phe dlrectory ‘/hone/abhl/.C3che/plp' or Its parent dlrectory Is not owned by th 
current user and cachlng wheels has been dlsabled. check the pernlsstons and o 
jwner of that dlrectory, If executlng plp with sudo, you may want sudos -H flag, 
jcollectlng keras 

DQwnload lng Keras-2 * e * 8-pyz.pv3-none-any * whl (Z76kB) 

100% 

ICollecting pyyanl (fron keras) 


Ii 276kB 177kB/s 


DownXoadin 

100% I 



I 2S6kB 272kB/s 




Requirerent already satlsfled (use --upgrade to upgrade) 
/pythonB/dlst-packages (fron keras) 

Requlrenent already satlsfled (use --upgrade to upgrade) 
lb/python3/dist-packages (fron keras) 

Requirenent already satlsfled (use --upgrade to upgrade) 
b/pythonB/dtst-packages (fron keras) 

Installlng collected packages^ pyyanl, keras 
Runnlng setup.py Install for pyyanl ,,, done 
Successfully Installed keras-2*0.S pyyanl-B*12 
You are uslng plp version S.1.1, however verslon 9 
You should conslder upgradlrtg via the plp Install 
(universe) abht|iubuntu | 


slx>=i,g.0 In /usr/llb 
nunpy>=li9*l In /usr/1 
sctpy>=0.14 In /usr/lt 


0.1 Is avallable. 
--upgrade plp' connand. 


Figure 5-5, Installing Keras 


We will check now if Keras uses the TensorFlow backend or not. From the terminal 
Anaconda environment you enabled first, you need to switch to Python mode. 

If you get the following resuit importing Keras, that means everything is working 
(see Figure 5-6). 


(universe) abhi@ubuntu:~$ python 

Python 3.5.3 |Anaconda custom ( 64 -bit)| (default, Mar 6 2017, 11:58:13) 
[GCC 4.4.7 20120313 (Red Hat 4 . 4 . 7 - 1 )] on linux 

Type "help", "Copyright", "credits" or "license" for more information. 

>>> import keras 

Using TensorFlow backend. 




iiP>i 


(universe) abhl{^ubuntu:^$ python 
Python 3.5.3 JAnaconda custom (64-blt) 
[GCC 4.4.7 20120313 (Red Hdt 4.4,7-1)] 
Type "help", "Copyright", "credits" or 
>» inport keras 
Ustnq TensorFlow backend. 


(default, Mar 6 2017, ll:5Srl3) 
on linux 

"license" for more Infornatlon. 


Figure 5-6, Keras with the TensorFlow backend 
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Using ChainerRL 

This section covers ChainerRL and explains how to apply Reinforcement Learning using 
it. ChainerRL is a deep Reinforcement Learning library especially built with the help of 
the Chainer Framework. See Figure 5-7. 






ner 


▼ 


Uses Python 

Language 




▼- 


Implements Deep Reinforcement 
Learning 


Figure 5-7, ChainerRL 


Installing ChainerRL 

We will install ChainerRL first from the terminal window. Figure 5-8 shows the Anaconda 
environment. 
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Termina 

Terminal File Edit View Search Terminal Help 

abbl^^ubuntu: 

^ Ut 41)) 6:44 AM 


exanples.desktop Reinforcenent-learning-wi 

extra.py Tenplates 

th-tensorflow 





acttvate universe 
:-S dir 


Game-Al 

Keras-FlappyBird 
keras-rl 
Huslc 

particleswartn. py 
abhi^ubuntu ;^$ source 
(universe) abhi@ubuntu 
3naconda3 

Anac!>nda3-4,2, e* Linux-x86_64. sh 

Oesktop 

oocunents 

Oownloads 

exampies.desktop 

extra,py 

Gane-AI 

gypi 

KeraS’FlappyBird 

keras-rl 

Muste 

particleswarm.py 
(universe) abhi(gubuntu:^$ | 


universe 
untitledl.ipynb 
UntitledZiipynb 
Untitled.ipynb 
Videos 


Pictures 

Public 

pyndptoolbox 
qlearning.py 
rastrigin.png 

Reinforcerient - lear ning-with-tensor flow 

Tenplates 

universe 

Untitledl.ipynb 

untitled2.ipynb 

untitled,ipynb 

Videos 


Figure 5-8, Activating the Anaconda environment 


You can now install ChainerRL. To do so, type this command in the terminal: 


pip install chainerrl 


Figure 5-9 shows the resuit of the installation. 



Figure 5-9, Installing ChainerRL 
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Nowyou can git clone the repo. Use this command to do so: 
git clone https://github.com/chainer/chainerrl.git 
Figure 5-10 shows the resuit. 


Termina Temiinal File Edit View ^iearch Terminal Help 


H ^ 7;ZZAM ■£!> 




abhip<ubuntu; ^ 



SuccessfulXy tnstalled cached^property*1.3-J chatner-2.1.9 chalnerrX-e.Z.a fXXel 
ock’2.e.l2 future-©.16.0 mock- 2 . 0.0 pbr-3.l.l 
(urtXuerse) abhlQubuntu:»S dXr 





amacundas 

naconda3-4.2.0 - LXniLi!X-xS6_64. sh 
Desktop 
Docunents 
DownXaads 
exar^pXes .desktop 
extra.py 
Cane-AT 
gyn 

Keras-FXappyBlrd 
keras-rX 
Husic 
particXeswarn.py 

(universe) abhi0ybuntu: -$ git clone https://github.copi/chainer/chainerrX.git 

CXoning into *chaXrierrl ‘ .. . 

renote: Counting objects: 5998, done, 

renote: TotaX 5998 (delta 0 ), reused © (deXta e), pack-reused S998 
Receivirg objects: 100)6 (S998/S998). 6.65 HiB } 41.00 Kt8/s, done. 

ResoXvirg deXtas: 160)6 (4385/4385), done. 

Checking connectivity... done. 

(universe) abhlgubuntu;~9 | 


Pictures 

Public 

pyndptooibox 

qXearning.py 

rastrigin.png 

Reinforcenent-Xearntng-wlth-tensorfXow 

TenpXates 

universe 

lintitledl .ipynb 

UntitXed2.ipynb 

UntitXed.Ipynb 

Videos 


Figure 5-10, Cloning ChainerRL 

Then get inside the chainerrl folder, as shown in Figure 5-11. 



=D 



abhipubuntu; '^/chainerrl 


[renote: Total 5998 (delta 6), reused 0 (delta 0)j pack-reused 5990 
ReceXving objects: (5998/5998), 6.65 HiB J 41.66 KiS/s, done. 

ResoXving deitas: 106% (4385/4385), done. 

[Checking connectivity... done. 

[(universe) abhigubuntu dir 
anaconda3 

|Anaconda3-4.2.6-Linux-x&6_64.sh 
IchainerrX 
[Desktop 
[Docunents 
[DownXoads 
[exanpXes.desktop 
[extra .py 
|Cane-AI 
gyn 

Keras-FXappyBlrd 
[keras- rX 
iMusic 

(universe) abhigubuntu :cd chainerrl 
[(universe) abhigubuntu;«/chalnerrXS dir 
[assets docs READHE.Pid 

[chainerrl exanples readthedocs.ynl 

CONTRIBUTING.Pid LICEN5E requlrepients.txt 
(universe) abhigubuntu;~/chalnerrl5 | 


particXeswarn.py 

Plctures 

Public 

pypidptoolbox 
qlearning.py 
rastrigin.png 

fieinforcenent-learnlng-with-tensorflow 

Tenpiates 

universe 

Untitledl.ipynb 

Untitled2.ipynb 

Untitled.ipynb 

Videos 


Setup.py 

test^exanples.sh 
tests 


tools 


Figure 5-11, Inside the chainerrl folder 
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Pipeline for Using ChainerRL 

Since the library is based on Python, the obvious language of choice is Python. Follow 
these steps to set up ChainerRL: 

1. Import the gym, numpy, and supportive chainerrl libraries. 
import chainer 

import chainer.functions as F 
import chainer.links as L 
import chainerrl 
import gym 
import numpy as np 

You have to model an environment so thatyou can use OpenAI Gym (see Figure 5-12). 
The environment has two spaces: 

• Observation space 

• Aetion space 

They musthave two methods, reset and step. 


ChainerRL 


Environment 



InitiaI State 



Env.reset 



Figure 5-12, How ChainerRL uses state transitions 
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2. Take a simulatiori environment such as Cartpole-vO from the 
OpenAI simulatiori environment. 

env = gym.make('CartPole-vO') 

print('observation space:', env.observation_space) 
print('action space:', env.action_space) 
obs = env.resetO 
env.renderO 

print('initial observation:', obs) 
action = env.action_space.sample() 
obs, r, done, info = env.step(action) 
print('next observation:', obs) 
print('reward:', r) 
print('done:', done) 
print('info:', info) 


3. Now define an agent that will run from interactions with the 
environment. Here, ifs the QFunction(chainer.Chain) class: 

def _init_(self, obs_size, n_actions, n_hidden_ 

channels=50): 

super()._init_( 

10=L.Linear(obs_size, n_hidden_channels), 
ll=L.Linear(n_hidden_channels, n_hidden_ 
channels), 

12=L.Linear(n_hidden_channels, n_actions)) 
def _call_(self, x, test=False): 

II II II 

Args: 

X (ndarray or chainer.Variable): An 
observation 

test (bool): a flag indicating whether it 
is in test mode 

II II II 

h = F.tanh(self.10(x)) 
h = F.tanh(self.ll(h)) 
return chainerrl.action_value. 
DiscreteActionValue(self.l2(h)) 
obs_size = env.observation_space.shape[0] 
n_actions = env.action_space.n 
q_func = QFunction(obs_size, n_actions) 
we apply 0 learning etc. 

We start with the Agent, 
gamma = 0.95 

# Use epsilon-greedy for exploration 

explorer = chainerrl.explorers.ConstantEpsilonGreedy( 
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epsilon=0.3, random_action_func=env.action_space. 
sample) 

# DON uses Experience Replay. 

# Specify a replay buffer and its capacity. 
replay_buffer = chainerrl.replay_buffer. 
ReplayBuffer(capacity=10 ** 6) 

# Since observations from CartPole-vO is numpy.float64 
while 

# Chainer only accepts numpy.float32 by default, 
specify 

# a converter as a feature extractor function phi. 
phi = lambda x: x.astype(np.float32, copy=False) 

# Now create an agent that will interact with the 
environment. 

agent = chainerrl.agents.DoubleDQN( 

q_func, optimizer, replay_buffer, gamma, explorer, 
replay_start_size=500, update_interval=l, 
target_update_interval=lOO, phi=phi) 

4. Start the Reinforcement Learning process. You have to open 
the jupyter notebook first in the Universe environment, as 
shown in Figure 5-13. 


Termini Terminal File Edit View Search Terminal Help 

abht(^ubuntu: ~ 


uste Vtdeos 

abhtgubuntu cd chalnerrl 
abhifgubuntu :^/chainerrl$ dlr 
assets docs READrtE »nd 

chatnerrl exariptes readthedocs*y(^l 

CONTRIBUTING.nd LTCENSE requlrenerts.txt 
abhl(|ubuntu :^/chainerrl$ cd chatnerrt 

abhi@ubunt(j ^"'/chalnerrl/chalnerrl.^ cd - 

abhtgubuntu jupyter notebook 

il 1.^: ^4: S1. Not'bookApp ■ [nb conda kernels] enabled^ 


H ^ 4 >)) 1 Z;SSPM 





Setup.py 

test_eKaj^ples. sh 
tests 


toots 



"^91 N-t; 


3 kernels found 

bookApp Writing notebook server cookie secret to /run/user/ 


1009/jupyter/notebook_cookle_seGret 


12 : 54 


61 


U 12:54:S4.6i7 
[bbrowserpdf' 


NotebookApp] / nbpresent 
NotebookApp] X nbpresent 


HIHL export ENABLED 
PDF export OISABLEO: 


No nodule naned ’n 



NotebookApp [nb_ariacondacIoud] enabled 
NotebookApp] [nbconda] enabled 

NotebookApp: Servtng notebooks fron local dlrectory: /hone/abhl 
e active kernels 

The Jupyter Notebook is running at: http://localhos 


NotebookApp 

NotebookApp] 


kernels 


55.377 

Ctwice 


N'^te‘ 0 krt, t use control-c to stop this server and shut down all 
to skip confirnation). 


Figure 5-13, Getting inside jupyter notebook 


abhi@ubuntu:~$ source activate universe 
(universe) abhi@ubuntu:~$ jupyter notebook 
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Figure 5-14 shows running the code on the final go. 


UntltledS - MozlUa Firefox 



t| ^ ^i>) 1:OSPM ili 


T Resto re Sess ion x 

Home 

Untitled3 

S Inbox [17.701) X + 




'Xj 1 lQcalhQst'aaa8/hQteb'&5k.s/ll::i:itl&d 3- 




a ^ ^ 


;^ jupyter Uniitled3 

File EcUt VIew Imsert 


Cell 


& 



<S3 


- 1 

♦ 

—- 


■ 

ir. 

c I 


Komcl Widgels 
Code ^ 


Hotp 




Trusled 


Logdul 
I Python 3 O 


In :[7J: n episodes = 

nuaxepisodelen = 209 
for i in ranged, n^episodes + l)i 
obs = efiv . resetT) 
reward = 9 
dane = False 

R = 9 # retorn (swn of rewardsj 

t = G ^ ti^ne step 

while not tJone and t < max_episode_leri: 

# f/nroflSTjent to W3tch the behsviour 

# env.renderi} 

action = agentpact_and_train(ob5, reward) 
obs, reward, done, _ = env,step(action) 

R 4= reward 
t += 1 

if i \ 10 = 6: 

print( 'episode:' , i, 

'R:', R, 

'statistics:' > agent *get_statistics(I} 
agent-stop episode and traintobSj reward^ done) 
print( 'Finished.' ) 

episode: le R: 16,0 statistics: [ ( 'averaqe q'. 0,61421256518621045). 

Figure 5-14, Running the code 



5. Nowyou test the agents, as shown in Figure 5-15. 
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UntklGd3 ‘ Mozllla Firefox 



r Restofeses^iofi x 


Home 


^ Cv localhost:e&88/fiotebooks/unt!t.K;3,i 


urtticledi 


Q \ Q. iCrarch 


jupyter Untitled3 


File 


ECHI 


V>ew 


insen 


ceti 


Kemei vvid^ets Help 








M 


t di>) 1:11 PM <> 
2 fnbOR (17,731 n + 


g 4^ » = 

Logout 


TnjsJed 


F/lfton3 O 


+ e ^ 4^ U ■ C CoPe j 

' action = agentpact(obsl 

obs, r, dpne, = env^stepfaction) 

R += r 
t += 1 

printCtest episode: ■ , i, 'R:V, R) 
agent, st opepis ode (} 

test episode: s R: 148.6 
test episode; 1 R: 188.8 
test episode: 2 R: 288.8 
test episode: 3 R: 188.8 
test episode: 4 R: 288,8 
test episode: 5 R: 156.8 
test episode: 6 R: 181.8 
test episode: 7 R: 172.8 
test episode: S R: 163.8 
test episode: 9 R: 158.8 

In [9J: 5ai/e an agent to trte 'agent' directory 

agent.save{ 'agent' ) 

# Uncofjype/it to load an agent from the 'agent' directory 
agent.load [ 'agent' ) 


Figure 5-15, Testing the agen ts 

We completed the entire program in the jupyter notebook. Now we will work on one 
of the repos for understanding Deep Q Learning with TensorFlow. See Figure 5-16. 


Terminal 


48 ) l:lSPM ili- 


abhl{^ubuntu: ^ 





c^^cservtng notebooks frofi iocal directory: /hoj*ie/abhi 
1 active kernels 

he lupyter Notebook is runnlng at: http://localhost:8888/?token=36ead76cSS8dead 
7454296eea4bd43caa39e2S6e9eclc83e 

Shutdown this notebook server (y/[n])? Serving notebooks frorv local directory: / 
hone/abhi 
1 active kernels 

The lupyter Notebook is runnlng at; http;//Xoc3Xhost;8e3e/?token=36ead76ce5edead 
74S4286ed34bd43caa38e2S669eclcS3e 

ShutdoNn this notebook server Cy/[n])? [C 13:17:SS.321 NotebookApp] received sig 
naX z, stopping 

[c 13:17:53,938 NotebookApp] received signaX 2» stopping 
: : 1:: 1 ^^ S.7 . No ■ :■ bo ot n p^ 1 Shutting down kernels 
*C[C 13:17:58.211 NoteboolcApp] received stgnaX 2^ stopping 
No answer for 5s: resuning cperatton... 

No answer for 5s: resuning operatlon... 

^C[C 13:18:81.282 NotebookApp] received signaX 2, stopping 
^C[C 13:18:81»579 NotebookApp] received signaX 2, stopping 
T n; i8:c'?.60'3 'Io'.' 4- ; Kemei shutdown: a254c92e’8eie'4d29'9272-66bsec6c6a 

2b 

(universe) abhi@ubuntu : 

(universe) abhi@ubuntu :git clone https://github.con/carpedn28/deep-rl-tensorf 
llow.giti 


Figure 5-1 6, Cloning the GitHub repo 
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First you need to install the prerequisites as follows (see Figure 5-17): 
pip install -U 'gym[all]' tqdm scipy 



EUl 


idna-z.ii keras-z.e.s requests-z.is.'^ slx-i.ii.9 tqd(i-4.i7.i urlIib3-i.Z2 
(universe) abhi@ubuntu;'^$ dlr 


agent 

|anaconda3 

lnaconda3-4.Z.Q“Linux-xS6_64.sh 
khalnerrX 
deep-rl-tensorflow 
iDesktop 
pocunents 
pownloads 
exanples.desktop 
extra.py 
Gane-Al 
gyn 

Keras-FlappyBlrd 

keras-rl 


parttcleswarn.py 

Ptctures 

Public 

pyndptoolbox 
qlearnlng.py 
rastrlgin.png 

Relnforcenent^learntng-with'tensorflow 

Tenplates 

universe 

untitledi.ipynb 

UntitledZ.Ipynb 

Untltled3.1pynb 

untitled.ipynb 

Videos 


Husic 

(universe) abhi(gubuntu:^$ cd deep-ri_tensorflow 

bash: cd: deep-rltensorfiow: No such file or directory 

(universe) abht(|ubujitu;~S cd deep-rl-tensorflow 

(universe) abhl@ubuiitu;^/dcep-rl-tensorflow5 dir 

agents environments nain.py HEADME.nd uttis.py 

assets LICENSE networks test.sh 

(universe) abhi@ubmitu:-/deep-rl-ten5orflQw$ |_ 


Figure 5-1 7. Getting inside thefolder 


Then run the program and train it without using GPU support, as shown in Figure 5-18. 


Terminal 


4)3) 1;S7PM 




/“N 


abKI(2)ubuntu: -/deep-rl-tensorflow 


lagent 

|anaconda3 

^naconda3-4.2.0- Linux-xS6_64.sh 
khainerrl 
Ideep-ri-tensorfiow 
pesktop 
pocynents 
pownloads 
lexanples.desktop 
lextra.py 
Icane-AI 

gy'*’ 

iKeras-FlappyBird 
Ikeras-rl 


particleswari«t.py 

Ptctures 

Public 

pyndptoolbox 
qlearniog.py 
rastrlgin.png 

Reinforcenent-learning-with-tensorflow 

Teriplates 

universe 

Lfntltledl. ipynb 

UntitledZ.ipynb 

untitled3.ipynb 

lintltled. ipynb 

Videos 


pustc 

Kuniverse) abht@ubuntu:-$ cd deep-ri_tensorflow 
Ibash: cd: deep-ritensorflow: No such file or directory 
l(universe) abhl^ubuntui-S cd deep-rl-tensorflow 
Kuniverse) abhl@ubuntu:«/deep-rl-tensorflowS dir 
lagents environnents nain.py READHE.nd utils.py 

lassets LTCENSE networks test.sh 

l(universe) abhi@ubuntu j-Zdeep-rl-tensorflowS python nain.py --network_header_typ 
|e=nips --env_nane=Sreakout-v6 --use_gpu=False 


Figure 5-18, Training the program without GPU support 
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The command is as follows: 

$ python main.py --network_header_type=nips --env_name=Breakout-vO --use_ 
gpu=False 

The command uses the main.py Python file and runs the Breakout game simulation 
in CPU mode only. You can now open the terminal to get inside the Anaconda 
environment, as shown in Figure 5-19. 


Tennina Terminal File Edit View Search ^.'rmiinal Help 

abhlcl^ubuntu: - 


[abh1.@ubiintu :source acttvate universe 
[(universe) abhi@ubuntu ;^$ | 


H ^ 2^19 AM ill 



Figure 5-19, Activating the environment 


Now switch to Python mode, as shown in Figure 5-20: 

(universe) abhi@ubuntu:~$ python 

Python 3.5.3 |Anaconda custom ( 64 -bit)| (default, Mar 6 2017, 11:58:13) 
[GCC 4.4.7 20120313 (Red Hat 4 . 4 . 7 - 1 )] on linux 

Type "help", "Copyright", "credits" or "license" for more information. 
>>> 
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Figure 5-20, Switching to Python mode 


As you switch to Python mode, you first import the Utilities: 

import gym 
import numpy as np 

To get the observation along the frozen lake simulation, you have to formulate the Q 
table as follows: 

0 = np.zeros([env.observation_space.n,env.action_space.n]) 

After that, you declare the learning rates and create the lists to contain the rewards 
for each state. 

import gym 

import numpy as np 

env = gym.make('FrozenLake-vO') 

#Initialize table with all zeros 

0 = np.zeros([env.observation_space.n,env.action_space.n]) 

# Set learning parameters 
Ir = .8 
y = .95 

num_episodes = 2000 

#create lists to contain total rewards and steps per episode 
#iList = [] 
rList = [] 

for i in range(num_episodes): 
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#Reset environment and get first new observation 
s = env.resetO 
rAll = 0 
d = False 

j = 0 

#The Q-Table learning algorithm 
while j < 99: 

j+=l 

#Choose an action by greedily (with noise) picking from 0 table 
a = np.argmax(0[s,:] + np.random.randn(l,env.action_space.n)* 
(i./(i+i))) 

#Get new state and reward from environment 
sl,r,d,_ = env.step(a) 

#Update Q-Table with new knowledge 

Q[s,a] = Q[s,a] + lr*(r + y*np.max(Q[sl,:]) - Q[s,a]) 

rAll += r 

s = sl 

if d == True: 
break 

#jList.append(j) 

rList.append(rAll) 

print "Score over time: " + str(sum(rList)/num_episodes) 
print "Final Q-Table Values" 
print Q 

After going through all the steps, you can finally print the Q table. Each line should 
be placed into Python mode. 


Deep Q Learning: Using Keras and TensorFlow 

We will touch on Deep Q Learning with Keras. We will clone an important reinforcement 
library, which is known as Keras-rl. It has several States of the Deep Q Learning 
algorithms. See Ligure 5-21. 
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Keras- rl 


t 

State of theart DeepQ learning algorithms 


▼ 



Figure5-21, Keras-rl representation 


Installing Keras-rl 

The command for installing Keras-rl is as follows (see Figure 5-22): 
pip install keras-rl 
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Terml[^' Terminal File Edit View Search Terminal Help 

abhl^^ubitntui» 


ti ^ 4i}) 3:19AM ^ 






Requlrepient already satisfied: theano in ./anaconda3/envs/universe/ilb/python3.5 
/stte-packages (fron keras«:3.e^7j,>=1.0.7->keras-rl) 

ftequtrenent aiready satisfled: pyyaril in ./anaconda3/eov£/universe/iib/python3.S 
/site-packages (fron keras-c2.6-7,>=l*0*7->keras-ri) 

Requlrenent already satisfied: six ir ./anact>nda3/envs/yniverse/llb/pytltion3.5/sl 
te-packages (fron .7,>=l.0.7->ker3S-rl> 

Requlrenent already satisfiedr nunpy>=l.9.l in ./anaconda3/envs/untverse/lib/pyt 
hor3*5/site-packages (fron theano->keras<2.0.7,>=1*©*7->kera£-rl) 

Requlrenent already satisfied: sctpy>=0.14 in ./anaconda^/envs/unlverse/llb/pyth 
on3»5/stte-package£ (fron the3n{>->keras<2.0.7,>=l.e.7->keras-rl3 
Building wheels for collected packages: keras-rl, keras 
Running setup.py bdist_wheel for keras-rl done 

Stored in directory: /hone/abhi/.cache/pip/wheel5/Sb/3f/0e/d0dbbcddddf6dl4b412 
93Sb22e609Se72de5464123fdaeb7d9 

Running setup.py bdt£t_wheel for keras ... done 

stored in directory: /hone/abhi/,cache/pip/wheels/cZ/Q0/ba/2beab8cZ13le2dcc391 
eeea2f5SeS4Bafe634S115c245eSS39 
Successfully built keras-rl keras 
installing collected packages: keras, keras-rl 
Found existing installation: Keras 2.0,3 
Ufitnstalling Keras-2.0.S: 

Successfully uoinstalled Keras-Z.O.B 
successfully installed keras-2.0.6 keras-rl- 0 . 3.1 
(universe) abhi@ubynhtu:-S | 


Figure5-22, Installing Keras-rl 


You also need to install hSpy if it is not already installed and then you need to clone 
the repo, as shown in Figure 5-23. 


Terminal 


H 3:21 AM O- 




abhi(^ubuntu: ~ 




Stored in directory: /hone/abhi/.cache/pip/wheels/8b/3f/ee/d0dbbcddddf6dl4b412 
|93Sb2286098872deS464123fdaeb7d9 

Running setup.py bdist_wheel for keras ... done 

Stored in directory: /hone/abhi/.cache/pip/wheels/c2/80/ba/2beab8c2131e2dcc391 
ee8a2f55e648af66348115c24See839 
Successfully built keras-rl keras 
Installing collected packages: keras, keras-rl 
Found existing installation: Keras 2.6.8 
Uninstalling Keras-2.0.8: 

H Successfully uninstalled Keras-2.0.8 

BEPJSuccessfully installed keras-2.0.6 keras-rl-0.3.1 
universe) abhlQubuntu :~$ pip install hSpy 
Requlrenent already satisfied: hSpy in ./anaconda3/envs/unlverse/lib/python3.5/s 
ite-packages 

Requlrenent already satisfied: nunpy>sl.7 in ./anaconda3/envs/universe/lib/pytho 
n3.5/site-packages (fron hSpy) 

^■Requlrenent already satisfied: six in ./anaconda3/envs/unlverse/lib/python3.5/sl 
Bl^Jte-packages (fron hSpy) 

■l|l^(universe) abhlQubuntu:~$ git clone https://github.con/natthlasplappert/keras-rl 

^S^^HCloning into 'keras-rl'... 

^^^^■renote: Counting objects: 1309, done. 

^^^^■renote: Conpressing objects: lOOX (48/48), done. 

^^^Seceiving objects: 17X (234/1309), 476.01 KiB | 75.00 KiB/s 



Figure 5-23, Cloning the git repo 
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Training with Keras-rl 

You will see how to run a program in this section. First, get inside the rl folder, as shown 
in Figure 5-24. 

abhi@ubuntu:~$ cd keras-rl 
abhi@ubuntu:~/keras-rl$ dir 

assets examples LICENSE pytest.ini rl setup.py 

docs ISSUE_TEMPLATE.md mkdocs.yml README.md setup.cfg tests 

abhi@ubuntu:~/keras-rl$ cd examples 
abhi@ubuntu:~/keras-rl/examples$ dir 

cem_cartpole.py dqn_atari.py duel_dqn_cartpole.py sarsa_cartpole.py 
ddpg_pendulum.py dqn_cartpole.py naf_pendulum.py visualize_log.py 
abhi@ubuntu:~/keras-rl/examples$ 



Figure 5-24, Getting inside the Keras-rl directory 


Nowyou can run one of the examples: 

abhi@ubuntu:~/keras-rl/examples$ python dqn_cartpole.py 
Activating the anaconda environment 

(universe) abhi@ubuntu:~/keras-rl/examples$ python dqn_cartpole.py 
See Figure 5-25. 
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pytest.tnl 
READHE.nd 


rl 

Setup.cfg 


Setup.py 
tests 


extra,py Tenplates 

Game-AI untverse 

gyrn Untitledl.lpynb 

Keras■FlappyBtrd UrtltledS.ipynb 

keras-rl Untltled.tpynb 

Muste Vtdeos 

particleswarn.py 
abhtgu&untu;~S cd keras-rl 
abhtgubuntu :--/keras-rlS dtr 
assets exanples LICENSE 

decs lSSdE_TEHPLATE.nd nkdocs.ynl 
abhi@ubuntur'^/keras-rl$ cd examples 
abhi@ubuntu: ->^/keras-rl/exanples$ dtr 

cen_cartpole,py dqn_atart.py duel_dqn_cartpole.py sarsa_cartpcile,py 
ddpgpendulun.py dqn^cartpole.py raf^pendulun,py vtsualtzelog.py 

abhi@ubuntu :^/keras-rl/exanples$ python dqncartpole.py 
Traceback (mst recent call last>: 

File "dqncartpole.py", line 2* tn <nodule> 
inport gyn 

iMportError: No nodule naned 

abbi@ubuntu: -/keras-rl/exanples$ source actlvate untverse 
(untverse) abbl@ubuntu:^/kerBs-rl/exanplesS python dqncartpole.py 
Ustng TensorFlow backend. 


Figure 5-25, Using the TensorFlow backend 

The simulatiori will nowbegin, as shown in Figure 5-26. 



t| i ^0) 9:37 AM ^ 


aetton: 


01, eptsode reward: 2ee.eee, nean reward: i.eee [i.oea, i.aae], nean 
|e,49S [e.eee, i.eee], near observatton: -e.e33 [-e.93i, e.eis], loss: i 
|near_absolute_error: 1S.392S27, neart_q: 31.264473 

3767/560 00: eptsode: 71, duratton: 3.3255, eptsode steps: 269^ steps per sec on 
d: 60, ept sl 

0.4S5 [e.oei 
nean absolu 
3967/5000 
d: 60, epts 
0.400 [0.00 
nean_absolu 
4167/5000 
d: 62, epts 
0.490 [0.00 
nean absolu 
4367/5000 
d: 54, epts 
0.475 [0.00 
nean_ab5olui 
4567/5000 
d: 55, epts 
0.465 [0.00 
nean absolu 


on 


on 


Figure 5-26, Simulation happens 
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The simulatiori occurs and trains the model using Deep Q Learning. With practice, 
the cart-pole will balance along the rope; its stability increases with learning. 

The entire process creates the following log: 

(universe) abhi@ubuntu:~/keras-rl/examples$ python dqn_cartpole.py 
Using TensorFlow backend. 

[ 2017 - 09-24 09 : 36 : 27 , 476 ] Making new env: CartPole-vO 


Layer (type) 

Output Shape 

Param # 

flatten_l (Flatten) 

(None, 4 ) 

0 

dense_l (Dense) 

(None, 16) 

80 

activation_l (Activation) 

(None, 16) 

0 

dense_2 (Dense) 

(None, 16) 

272 

activation_2 (Activation) 

(None, 16) 

0 

dense_3 (Dense) 

(None, 16) 

272 

activation_3 (Activation) 

(None, 16) 

0 

dense_4 (Dense) 

(None, 2 ) 

34 

activation_4 (Activation) 

(None, 2 ) 

0 

Total params: 658 

Trainable params: 658 

Non-trainable params: 0 


None 

2017 - 09-24 09 : 36 : 27 . 932219 : W tensorflow/core/platform/cpu_feature_guard. 
cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but 
these are available on your machine and could speed up CPU computations. 

• • • 

712 / 50000 : episode: 38, duration: 0.243s, episode steps: 14, steps per 
second: 58 , episode reward: 14.000, mean reward: 1.000 [1.000, 1.000], mean 
action: 0.500 [O.OOO, l.OOO], mean observationi 0.105 [-0.568, 0.957], loss 
0 . 291389 , mean_absolute_error: 3.054634, mean_q: 5.816398 

The episodes are iterations for the simulations. The cartpole. py code is discussed 
next. You need to import the Utilities first. The Utilities included are very useful, as they 
have built-in agents for applying Deep Q Learning. 

First, declare the environment as follows: 

ENV_NAME = 'CartPole-vO' 
env = gym.make(ENV_NAME) 
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Since we want to implement Deep Q Learning, we use parameters for initializing the 
Convolution Neural NetWork (CNN). We also use an activation function to propagate the 
neural network. We keep it sequential. 

model = SequentialO 

model.add(Flatten(input_shape=(l,) + env.observation_space.shape)) 

model.add(Dense(l6)) 

model.add(Activation('relu')) 

model.add(Dense(l6)) 

model.add(Activation('relu')) 

model.add(Dense(l6)) 

model.add(Activation('relu')) 

model.add(Dense(nb_actions)) 

model.add(Activation('linear')) 

You can print the model details too, as follows: 

print(model.summary()) 

Next, configure the model and use all the Reinforcement Learning options with the 
help of a function. 

import numpy as np 
import gym 

from keras.models import Sequential 

from keras.layers import Dense, Activation, Flatten 

from keras.optimizers import Adam 

from rl.agents.dqn import DONAgent 

from rl.policy import BoltzmannOPolicy 

from rl.memory import SequentialMemory 

ENV_NAME = 'CartPole-vO' 

# Get the environment and extract the number of actions. 
env = gym.make(ENV_NAME) 

np.random.seed(l23) 

env.seed(l23) 

nb_actions = env.action_space.n 

# Next, we build a very simple model. 
model = SequentialO 

model.add(Flatten(input_shape=(l,) + env.observation_space.shape)) 

model.add(Dense(l6)) 

model.add(Activation('relu')) 

model.add(Dense(l6)) 

model.add(Activation('relu')) 

model.add(Dense(l6)) 

model.add(Activation('relu')) 

model.add(Dense(nb_actions)) 

model.add(Activation('linear')) 

print (model. summary 0) 
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# Finally, we configure and compile our agent. You can use every built-in 
Keras optimizer and 

# even the metrics! 

memory = SequentialMemory(limit=50000, window_length=l) 
policy = BoltzmannOPolicyO 

dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_ 
warmup=10, 

target_model_update=le-2, policy=policy) 
dqn.compile(Adam(lr=le-3), metrics=['mae']) 

# Okay, now it's time to learn something! We visualize the training here for 
Show, but this 

# slows down training quite a lot. You can always safely abort the training 
prematurely using 

# Ctrl + C. 

dqn.fit(env, nb_steps=50000, visualize=True, verbose=2) 

# After training is done, we save the final weights. 
dqn.save_weights('dqn_{}_weights.h5f'.format(ENV_NAME), overwrite=True) 

# Finally, evaluate our algorithm for 5 episodes. 
dqn.test(env, nb_episodes=5, visualize=True) 

To get all the capabilities of Keras-rl, you need to run the setup. py file within the 
Keras-rl folder, as follows: 

(universe) abhi@ubuntu:~/keras-rl$ python setup.py install 

You will see that all the dependencies are being installed, one by one: 

running install 
running bdist_egg 
running egg_info 
creating keras_rl.egg-info 

writing requirements to keras_rl.egg-info/requires.txt 

writing dependency_links to keras_rl.egg-info/dependency_links.txt 

writing top-level names to keras_rl.egg-info/top_level.txt 

writing keras_rl.egg-info/PKG-INFO 

writing manifest file 'keras_rl.egg-info/SOURCES.txt' 

reading manifest file 'keras_rl.egg-info/SOURCES.txt' 

writing manifest file 'keras_rl.egg-info/SOURCES.txt' 

installing library code to build/bdist.linux-x86_64/egg 

running install_lib 

running build_py 

creating build 

creating build/lib 

creating build/lib/tests 

copying tests/_init_.py -> build/lib/tests 

creating build/lib/rl 

copying rl/util.py -> build/lib/rl 

copying rl/callbacks.py -> build/lib/rl 
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copying rl/keras_future.py -> build/lib/rl 
copying rl/memory.py -> build/lib/rl 
copying rl/random.py -> build/lib/rl 
copying rl/core.py -> build/lib/rl 

copying rl/_init_.py -> build/lib/rl 

copying rl/policy.py -> build/lib/rl 
creating build/lib/tests/rl 

copying tests/rl/test_util.py -> build/lib/tests/rl 
copying tests/rl/util.py -> build/lib/tests/rl 
copying tests/rl/test_memory.py -> build/lib/tests/rl 
copying tests/rl/test_core.py -> build/lib/tests/rl 

copying tests/rl/_init_.py -> build/lib/tests/rl 

creating build/lib/tests/rl/agents 

copying tests/rl/agents/test_cem.py -> build/lib/tests/rl/agents 

copying tests/rl/agents/_init_.py -> build/lib/tests/rl/agents 

copying tests/rl/agents/test_ddpg.py -> build/lib/tests/rl/agents 
copying tests/rl/agents/test_dqn.py -> build/lib/tests/rl/agents 
creating build/lib/rl/agents 

copying rl/agents/sarsa.py -> build/lib/rl/agents 
copying rl/agents/ddpg.py -> build/lib/rl/agents 
copying rl/agents/dqn.py -> build/lib/rl/agents 
copying rl/agents/cem.py -> build/lib/rl/agents 
copying rl/agents/_init_.py -> build/lib/rl/agents 

Keras-rl is now set up and you can use the built-in functions to their fullest effect. 


Conclusion 

This chapter introduced and defined Keras and explained how to use it with 
Reinforcement Learning. The chapter also explained how to use TensorFlow with 
Reinforcement Learning and discussed using ChainerRL. Chapter 6 covers Google 
DeepMind and the future of Reinforcement Learning. 
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CHAPTER 6 


Google’s DeepMind and the 
Future of Reinforcement 
Learning 


This chapter discusses Google DeepMind and Google AlphaGo and then moves on to 
the future of Reinforcement Learning and compares what's happening with man versus 
machine. 


Google DeepMind 

Google DeepMind (see Figure 6-1) was formed to take AI to the next level. The aim and 
motive of Google in this case is to research and develop programs that can solve complex 
problems without needing to teach it the steps for doing so. 



Figure 6-1, Google DeepMind logo 
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The link to visit the DeepMind web site is https: //deepmind. com/. 

This web site (see Figure 6-2) contains all the details and the future work they are 
doing. There are publications and research options available on the site. 



Figure 6-2, The DeepMind web site 


You will see that the web site has lots of topics to search and discover. 


Googie AlphaGo 

This section takes a look at AlphaGo (see Figure 6-3), which is one of the best Solutions 
from the Googie DeepMind team. 



Figure 6-3, The Googie AlphaGo logo 
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What Is AlphaGo? 

AlphaGo is the Google program that plays the game Go, which is a traditional abstract 
strategy board game for two players. The object of the game is to occupy more territory 
than your opponent. Figure 6-4 shows the Go game board. 



Figure 6-4, The Go board (Image courtesy ofjaro Larnos, https://]/^m,flickr, com/ 
photos/jlarnos/, used under a CC-BY2.0 license) 


Despite its simple rules, Go has more possible Solutions than the number of atoms in 
the visible world! 

The concept of the Go game and its underlying mathematical terms included are 
illustrated in Figure 6-5. 


157 





CHAPTER 6 GOOGLE’S DEEPMIND AND THE FUTURE OF REINFORCEMENT LEARNING 


Go 


t 

Simple Moves 


▼ 

So many possibilities 


t 

CombinatoriaI Game 
Theory 


Figure 6-5, Concept ofthe Go game 

AlphaGo is the first computer program to defeat a professional human Go player, the 
first program to defeat a Go world Champion, and arguably the best Go player in history. 
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Figure 6-6 illustrates the AlphaGo approach. 


Alpha Go 


▼ 



Figure 6-6, Deep Q approach 


Monte Carlo Search 

Monte Carlo Search (MCS) is based on the AI tree traversal approach. It uses a unique set 
of behaviors for moving through the tree. 

MCS first selects each state it can go through, as mentioned in the declared policy. 
After a certain depth, the policy does not allow the state to go through. MCS then expands 
from that state to the possible actions that can be taken randomly. This way, you are 
using MCS-based simulation to all possible States to get rewards. We you do a random 
simulation path, you also get Q state values for random paths if you change from one state 
to another. From the Q state received, you can back up Information and move to the top. 
The entire process is shown in Figure 6-7. 
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Figure 6-7, The Monte Carlo Search tree process 

AlphaGo relies on two components: A tree search procedure and convolutional 
networks that guide the tree search procedure. 

In total, three convolutional networks of two different kinds are trained: two policy 
networks and one value network. 
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Man vs. Machines 

With the advent of Reinforcement Learning, there are many more jobs being automated 
and many low-level jobs are being done by machines. 

Now the focus is on how Reinforcement Learning can solve different problems and 
change the well being of the earth. 

For example, Reinforcement Learning can be used in the healthcare field. Instead of 
using the same age-old tools for body scans, we can train robots and medical equipment 
to scan body parts for different diagnoses purposes much quicker and with greater 
accuracy. With repeated training, decisions to perform more complex measurements and 
scans can be left to the machines too. 


Positive Aspects of AI 

Cognitive modeling is applied when we gather information and resources and through 
which the system learns. This is called the cognitive way. Technological singularity is 
achieved by enhancement of cognitive modeling devices that interact and achieve more 
unified goals. 

A good strong AI solution is selfless and places the interest of others above all else. 
A good AI solution always works for the team. By adding human empathy, as seen with 
brainwaves, we can create good AI Solutions that appear to be compassionate. 

Applying a topological view to the world of AI helps streamline activities and allows 
each topology to master a specific, unique task. 


Negative Aspects of AI 

There can be negative aspects too. For example, what if a machine learns so fast that 
it starts talking to other machines and creates an AI of its own? In that case, it would 
be difficult for humans to predict the end game. We need to take these scenarios into 
consideration. Perhaps every AI solution needs a secret killswitch, as illustrated in 
Figure 6-8. 
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Figure 6-8, Insert a killswitch just in case 

Here are the steps to this basic process: 

1. We start a program. 

2. We apply Machine Learning to it. 

3. The program learns very quickly. 

4. We have to incorporate a killswitch into the process so that we 
can allow the program to be rolled back if necessary. 

5. When we see an anomaly or any abrupt behavior, we call the 
killswitch to roll the program back to the start. 

There is a good chance that machines may learn this way, especially if they work 
in tandem. At some transition point, they might start interacting in a way that creates 
an AI of their own. We have to be able to avoid collisions of two or more Reinforcement 
Learning programs during the transition phase. 
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Conclusion 

We touched on a lot of concepts in this book, especially related to Reinforcement 
Learning. The book is an overview of how Reinforcement Learning works and the ideas 
you need to understand to get started. 

• We simplified the RL concepts with the help of the Python 
programming language. 

• We introduced OpenAI Gym and OpenAI Universe. 

• We introduced a lot of algorithms and touched on Keras and 
TensorFlow. 

We hope you have liked the book. Thanks again! 
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